The document discusses data duplication elimination and the Basic Sorted Neighborhood (BSN) method. It describes how data duplication can cause problems and outlines the BSN method which involves concatenating data, creating keys, sorting records by key, and moving a window through the sorted records to compare neighboring records and identify duplicates. It notes challenges with dirty data and the need for standardization. The time complexity of BSN is analyzed and it is noted that further rules and an equational theory are needed to fully specify the matching inferences.
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
Bayard Rock, LLC, is a private research and software development company with headquarters in the Empire State Building. It is a leader in the filed in the research and development of tools for improving the state of the art in anti-money laundering and fraud detection. As you might imagine, these tools rely heavily on mathematics and graph algorithms. In this talk, Richard Minerich will discuss the research activities of Bayard Rock and its approaches to build tools to find the “bad guys”. Richard Minerich is Bayard Rock’s Director of Research and Development. Rick has expertise in F#, C#, C, C++, C++/CLI,. NET (1.1, 2.0, 3.0, 3.5, 4.0, and 4.5), Object Oriented Design, Functional Design, Entity Resolution, Machine Learning, Concurrency, and Image Processing. He is interested in working on algorithmically, mathematically complex projects and remains open to explore new ideas.
Rick holds 2 patents. The first one, co-invented with a colleague, is titled “Method of Image Analysis Using Sparse Hough Transform.” The other independently held is known as “Method for Document to Template Alignment.”
Databases are the heart of most PHP projects but roughly TWO PERCENT of PHP programmers have had any real training in Structured Query Language, SQL. Then they wonder why their queries perform poorly, why they get N+1 problems, and suddenly the database becomes the choke point of the project. This presentation will cover the basics of relational algebra (no algebra, math or calculus skills needed!!!!), how to think in sets with Venn Diagrams, and how to let the database do the heavy lifting for you. So if you want to write high performing database queries and be admired as a database deity by your co workers then you need to be in this session!
Traditional approaches in anti-money laundering involve simple matching algorithms and a lot of human review. However, in recent years this approach has proven to not scale well with the ever increasingly strict regulatory environment. We at Bayard Rock have had much success at applying fancier approaches, including some machine learning, to this problem. In this talk I walk you through the general problem domain and talk about some of the algorithms we use. I’ll also dip into why and how we leverage typed functional programming for rapid iteration with a small team in order to out-innovate our competitors.
Bayard Rock, LLC, is a private research and software development company with headquarters in the Empire State Building. It is a leader in the filed in the research and development of tools for improving the state of the art in anti-money laundering and fraud detection. As you might imagine, these tools rely heavily on mathematics and graph algorithms. In this talk, Richard Minerich will discuss the research activities of Bayard Rock and its approaches to build tools to find the “bad guys”. Richard Minerich is Bayard Rock’s Director of Research and Development. Rick has expertise in F#, C#, C, C++, C++/CLI,. NET (1.1, 2.0, 3.0, 3.5, 4.0, and 4.5), Object Oriented Design, Functional Design, Entity Resolution, Machine Learning, Concurrency, and Image Processing. He is interested in working on algorithmically, mathematically complex projects and remains open to explore new ideas.
Rick holds 2 patents. The first one, co-invented with a colleague, is titled “Method of Image Analysis Using Sparse Hough Transform.” The other independently held is known as “Method for Document to Template Alignment.”
Databases are the heart of most PHP projects but roughly TWO PERCENT of PHP programmers have had any real training in Structured Query Language, SQL. Then they wonder why their queries perform poorly, why they get N+1 problems, and suddenly the database becomes the choke point of the project. This presentation will cover the basics of relational algebra (no algebra, math or calculus skills needed!!!!), how to think in sets with Venn Diagrams, and how to let the database do the heavy lifting for you. So if you want to write high performing database queries and be admired as a database deity by your co workers then you need to be in this session!
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
This tutorial explains the Data Web vision, some preliminary standards and technologies as well as some tools and technological building blocks developed by AKSW research group from Universität Leipzig.
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
Report for the course 'XML and Web Technologies' of the IT4BI Erasmus Mundus Master's Programme. Introduction, motivation, target domain, schema, attributes, comparing RDFa with RDF, comparing RDFa with Microformats, comparing RDFa with Microdata, how to use RDFa to improve websites, how to extract metadata defined with RDFa, GRDDL and a simple exercise.
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.
This tutorial explains the Data Web vision, some preliminary standards and technologies as well as some tools and technological building blocks developed by AKSW research group from Universität Leipzig.
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
Report for the course 'XML and Web Technologies' of the IT4BI Erasmus Mundus Master's Programme. Introduction, motivation, target domain, schema, attributes, comparing RDFa with RDF, comparing RDFa with Microformats, comparing RDFa with Microdata, how to use RDFa to improve websites, how to extract metadata defined with RDFa, GRDDL and a simple exercise.
Tackling Hidden Risks in AML Sanctions Screening ProgramsAlessa
WATCH WEBINAR: https://www.caseware.com/alessa/webinars/tackling-hidden-risks-in-aml-screening-programs/
Over the past decade, a number of institutions have been fined for failing to screen all client records or screening them in a timely manner. Properly configured screening applications and efficient data flows can be more critical to an effective AML implementation than the AML policies themselves. With changes in organizational structure, AML policies, and M&A activities, understanding these systems can be difficult and complex.
In this presentation we reveal how to avoid some of the common pitfalls in the configuration of screening programs, including software configurations, risk scoring attributes and weights,and both client and reference data flows.
About Alessa, a CaseWare RCM product:
Alessa is a financial crime detection, prevention and management solution offered by CaseWare RCM Inc. With deployments in more than 20 countries in banking, insurance, FinTech, gaming, manufacturing, retail and more, Alessa is the only platform organizations need to identify high-risk activities and stay ahead of compliance. To learn more about how Alessa can help your organization ensure compliance, detect complex fraud schemes, and prevent waste, abuse and misuse, visit us at caseware.com/alessa.
Connect with us online:
Visit the Alessa WEBSITE: https://www.caseware.com/alessa/
Follow Alessa on LINKEDIN: https://www.linkedin.com/caseware-alessa
Follow Alessa on TWITTER: https://twitter.com/casewarealessa
SUBSCRIBE to Alessa on YouTube: http://tiny.cc/Alessa
PostgreSQL Tutorial for Beginners | EdurekaEdureka!
YouTube Link: https://youtu.be/-VO7YjQeG6Y
** MYSQL DBA Certification Training https://www.edureka.co/mysql-dba **
This Edureka PPT on PostgreSQL Tutorial For Beginners (blog: http://bit.ly/33GN7jQ) will help you learn PostgreSQL in depth.
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Lecture 20
1. Ahsan AbdullahAhsan Abdullah
11
Data WarehousingData Warehousing
Lecture-20Lecture-20
Data Duplication Elimination & BSN MethodData Duplication Elimination & BSN Method
Virtual University of PakistanVirtual University of Pakistan
Ahsan Abdullah
Assoc. Prof. & Head
Center for Agro-Informatics Research
www.nu.edu.pk/cairindex.asp
National University of Computers & Emerging Sciences, Islamabad
Email: ahsan1010@yahoo.com
2. Ahsan Abdullah
2
Why data duplicated?Why data duplicated?
A data warehouse is created from heterogeneous sources,
with heterogeneous databases (different
schema/representation) of the same entity.
The data coming from outside the organization owning the
DWH, can have even lower quality data i.e. different
representation for same entity, transcription or typographical
errors.
3. Ahsan Abdullah
3
Problems due to data duplicationProblems due to data duplication
Data duplication, can result in costly errors, such as:
False frequency distributions.
Incorrect aggregates due to double counting.
Difficulty with catching fabricated identities by credit card companies.
4. Ahsan Abdullah
4
Unable to determine customer relationships (CRM)Unable to determine customer relationships (CRM)
Unable to analyze employee benefits trendsUnable to analyze employee benefits trends
Name Phone Number Cust. No.
M. Ismail Siddiqi 021.666.1244 780701
M. Ismail Siddiqi 021.666.1244 780203
M. Ismail Siddiqi 021.666.1244 780009
Bonus Date Name Department Emp. No.
Jan. 2000 Khan Muhammad 213 (MKT) 5353536
Dec. 2001 Khan Muhammad 567 (SLS) 4577833
Mar. 2002 Khan Muhammad 349 (HR) 3457642
• Duplicate Identification Numbers
• Multiple Customer Numbers
• Multiple Employee Numbers
Data Duplication: Non-Unique PKData Duplication: Non-Unique PK
5. Ahsan Abdullah
5
Data Duplication: House HoldingData Duplication: House Holding
Group together all records that belong to the sameGroup together all records that belong to the same
household.household.
Why bother ?Why bother ?
……… S. Ahad 440, Munir Road, Lahore
……… ………….… ………………………………
……… Shiekh Ahad No. 440, Munir Rd, Lhr
……… Shiekh Ahed House # 440, Munir Road, Lahore
……… ………….… ………………………………
6. Ahsan Abdullah
6
Identify multiple records in each household whichIdentify multiple records in each household which
represent the same individualrepresent the same individual
Address field is standardized.Address field is standardized.
By coincidence ??By coincidence ??
……… M. Ahad 440, Munir Road, Lahore
……… ………….… ………………………………
……… Maj Ahad 440, Munir Road, Lahore
Data Duplication: IndividualizationData Duplication: Individualization
7. Ahsan Abdullah
7
Formal definition & NomenclatureFormal definition & Nomenclature
Problem statement:Problem statement:
““Given two databases, identify the potentially matchedGiven two databases, identify the potentially matched
recordsrecords EfficientlyEfficiently andand EffectivelyEffectively””
Many names, such as:Many names, such as:
Record linkageRecord linkage
Merge/purgeMerge/purge
Entity reconciliationEntity reconciliation
List washing and data cleansing.List washing and data cleansing.
Current market and tools heavily centeredCurrent market and tools heavily centered
towards customer lists.towards customer lists.
8. Ahsan Abdullah
8
Need & Tool SupportNeed & Tool Support
Logical solution to dirty data is to clean it in some way.
Doing it manually is very slow and prone to errors.
Tools are required to do it “cost” effectively to achieve
reasonable quality.
Tools are there, some for specific fields, others for specific
cleaning phase.
Since application specific, so work very well, but need
support from other tools for broad spectrum of cleaning
problems.
9. Ahsan Abdullah
9
Overview of the Basic ConceptOverview of the Basic Concept
In its simplest form, there is an identifying attribute (orIn its simplest form, there is an identifying attribute (or
combination) per record for identification.combination) per record for identification.
Records can be from single source or multiple sourcesRecords can be from single source or multiple sources
sharing same PK or other common unique attributes.sharing same PK or other common unique attributes.
Sorting performed on identifying attributes and neighboringSorting performed on identifying attributes and neighboring
records checked.records checked.
What if no common attributes or dirty data?What if no common attributes or dirty data?
The degree of similarity measured numerically, differentThe degree of similarity measured numerically, different
attributes may contribute differently.attributes may contribute differently.
10. Ahsan Abdullah
10
Basic Sorted Neighborhood (BSN) MethodBasic Sorted Neighborhood (BSN) Method
Concatenate data into one sequential list of N recordsConcatenate data into one sequential list of N records
Steps 1: Create KeysSteps 1: Create Keys
Compute a key for each record in the list by extracting relevant fieldsCompute a key for each record in the list by extracting relevant fields
or portions of fieldsor portions of fields
Effectiveness of the this method highly depends on a properlyEffectiveness of the this method highly depends on a properly
chosen keychosen key
Step 2: Sort DataStep 2: Sort Data
Sort the records in the data list using the key of step 1Sort the records in the data list using the key of step 1
Step 3: MergeStep 3: Merge
Move a fixed size window through the sequential list of recordsMove a fixed size window through the sequential list of records
limiting the comparisons for matching records to those records in thelimiting the comparisons for matching records to those records in the
windowwindow
If the size of the window isIf the size of the window is ww records then every new record enteringrecords then every new record entering
the window is compared with the previousthe window is compared with the previous w-1w-1 records.records.
11. Ahsan Abdullah
11
BSN Method : Sliding WindowBSN Method : Sliding Window
.
.
.
.
.
.
Current window
of records
w
Next window
of records
w
12. Ahsan Abdullah
12
BSN Method: Selection of KeysBSN Method: Selection of Keys
Selection of KeysSelection of Keys
Effectiveness highly dependent on the key selected to sort theEffectiveness highly dependent on the key selected to sort the
records middle name vs. family name,records middle name vs. family name,
A key is a sequence of a subset of attributes or sub-stringsA key is a sequence of a subset of attributes or sub-strings
within the attributes chosen from the record.within the attributes chosen from the record.
The keys are used for sorting the entire dataset with theThe keys are used for sorting the entire dataset with the
intention that matched candidates will appear close to eachintention that matched candidates will appear close to each
other.other.
First Middle Address NID Key
Muhammed Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmad 440 Munir Road 34535322 AHM440MUN345
Muhammed Ahmed 440 Munir Road 34535322 AHM440MUN345
Muhammad Ahmar 440 Munawar Road 34535334 AHM440MUN345
13. Ahsan Abdullah
13
BSN Method: Problem with keysBSN Method: Problem with keys
Since data is dirty, so keys WILL also be dirty, and
matching records will not come together.
Data becomes dirty due to data entry errors or use of
abbreviations. Some real examples are as follows:
Solution is to use external standard source files to validate the
data and resolve any data conflicts.
Technology
Tech.
Techno.
Tchnlgy
14. Ahsan Abdullah
14
BSN Method: Problem with keys (e.g.)BSN Method: Problem with keys (e.g.)
No Name Address Gender
1 Syed N Jaffri 420 15 4 Chaklala No Rawalpindi Street M
2 Syed Noman 420 4 Rwp Scheme M
3 Saiam Noor 5 Afshan Colony Flat Lahore Road Saidpur F
No Name Address Gender
1 N. Jaffri, Syed No. 420, Street 15, Chaklala 4, Rawalpindi M
2 S. Noman 420, Scheme 4, Rwp M
3 Saiam Noor Flat 5, Afshan Colony, Saidpur Road, Lahore F
If contents of fields are not properly ordered, similar records will NOT
fall in the same window.
Example: Records 1 and 2 are similar but will occur far apart.
Solution is to TOKENize the fields i.e. break them further. Use the
tokens in different fields for sorting to fix the error.
Example: Either using the name or the address field records 1 and 2 will
fall close.
15. Ahsan Abdullah
15
BSN Method: Matching CandidatesBSN Method: Matching Candidates
Merging of records is a complex inferential process.
Example-1:Example-1: Two persons with names spelled nearly but not
identically, have the exact same address. We infer they are same
person i.e. NomaNoma Abdullah and NomanNoman Abdullah.
Example-2:Example-2: Two persons have same National ID numbers but names
and addresses are completely different. We infer same person who
changed his name and moved or the records represent different
persons and NID is incorrect for one of them.
Use of further information such as age, gender etc. can alter theUse of further information such as age, gender etc. can alter the
decision.decision.
Example-3:Example-3: NomaNoma-F and NomanNoman-M we could perhaps infer that Noma
and Noman are siblings i.e. brothers and sisters. NomaNoma-30 and
NomanNoman-5 i.e. mother and son.
16. Ahsan Abdullah
16
Time Complexity: O(n log n)Time Complexity: O(n log n)
O (n) for Key CreationO (n) for Key Creation
O (n log n) for SortingO (n log n) for Sorting
O (w n) for matching, where wO (w n) for matching, where w ≤≤ 22 ≤≤ nn
Constants vary a lotConstants vary a lot
At least three passes required on the dataset.At least three passes required on the dataset.
Complexity or rule and window size detrimental.Complexity or rule and window size detrimental.
For large sets disk I/O is detrimental.For large sets disk I/O is detrimental.
Complexity Analysis of BSN MethodComplexity Analysis of BSN Method
17. Ahsan Abdullah
17
BSN Method: Equational TheoryBSN Method: Equational Theory
To specify the inferences we need equational
Theory.
Logic is NOT based on string equivalence.
Logic based on domain equivalence.
Requires declarative rule language.