SlideShare a Scribd company logo
DQS/MDS INTRO’S
&
DQS MATCHING
Microsoft
SQL Server 2012
SQL Server 2014
Neil Hambly
SQL Server
Evangelist /
Practice Lead
PASS
London
Chapter Leader
Melissa Data MVP
PASS
Virtual Chapter
“Professional
Development”
Leader
Contributing
Author
Agenda
Matching Project
What is record matching?
Data Issues
DQS Matching Process
DQS Data Matching Principles
Matching Policy
DQS Intro
MDS Intro
Data Cleansing:
Modifications, removal, correcting data or incomplete,
either computer-assisted or interactively.
Matching:
Identification of duplicates in a rules-based process,
perform de-duplication., verifying data quality using
reference data provider. Use reference data services from
Azure Marketplace providers
Profiling:
Analysis of data for insight into its data quality , domain
management, matching, and data cleansing processes.
Profiling is a powerful tool in a DQS data quality solution.
Monitoring:
Determine the state of data quality activities. Validate
data quality solution is doing what it was designed to do.
Knowledge Base:
DQS is a knowledge-driven solution , analyzing data using
knowledge built with DQS. Create data quality processes
to enhance the knowledge of data , continuously
improving data quality
• Create a Matching Policy
• Data Quality Matching
• Match Similar Data
Master Data Services Configuration Manager
Tool to create and configure Master Data Services
databases and web applications.
Master Data Manager
Web application for performing administrative tasks
(creating a model or business rule), and that users
access to modify master data.
MDSModelDeploy.exe
Tool to create packages of your model objects and
data, for deploying to other environments.
Master Data Services web service
Developers can use to develop custom solutions for
Master Data Services.
Master Data Services Add-in for Excel
Manage data and create new entities and attributes.
Import Example
Record matching is the task of identifying
records that match the same real world
entity.
The Cost of Duplicate Data
…a few examples…
Direct marketing communications are doubled up unnecessarily.
Product shipments and customer-site based services could be sent
to the wrong address due to an incorrect duplicate record being
used.
Your sales reporting may be inaccurate due to an over-
inflated number of customers.
Inaccurate sales analysis due to sales being split between multiple
records that represent the same customer, resulting in an
undervaluing of some key customers.
Where do Duplicate Records come from?
Poorly designed software No verification of existing records upon entry
Formatting &
abbreviations
"Doctor Robert Smith" Vs. "Dr. Bob Smith".
Data validation Human errors can creep into the system when fields’
input is not validated
Company merging and
acquisitions
Merging systems may result in duplicates in the merged
data.
Change of attributes The same person may appear to not exist in the
database if some of the attributes were changed
(e.g., address, name etc.)
…Data Issues…
There are different ways to represent the same person or address in a database:
Data is ‘fuzzy’ in nature (spelling mistakes, abbreviations etc.).
How Data Issues Affects Matching?
Matching Results
Matching Results Reasoning
The Data
Integrated
Profiling
Progress NotificationsStatus
Build
Use
DQ Projects
Knowledge Management
Knowledge
Base
Sample
Data
Identifies exact and approximate matches, enabling
removal of duplicate data.
Enables creating a matching policy interactively using a
computer-assisted process.
Ensures that values that are equivalent, but were entered
in a different format or style, are in fact rendered
uniform.
A matching policy is prepared in the knowledge base.
A matching policy consists of matching rules that
assess how well one record matches to another.
Specify in the rule whether records’ values have to be
an exact match, similar, or prerequisite.
Train your policy by running and tuning each rule
separately.
Identify the attributes in your data that are most
significant for matching.
Create domains/composite domains based on your data
structure.
Define matching rules.
Birth Date Gender
Composite Domain Full Name
F. Name M. Name L. Name Email Phone
Composite Domain Full Address
Street City State Country
Similarity, select Similar if field values can be similar. Select Exact if field values must
be identical.
Weight, determines the contribution of each domain in the rule to the overall
matching score for two records.
Prerequisite validates whether field values return a 100% match; else the records are
not considered a match.
Minimum matching score is the threshold at or above which two records are
considered to be a match.
Domains of type ‘Date’, ‘Integer’ or ‘Decimal’ can be matched using the
‘Similar’ property by assigning a tolerance either in percentage or integer.
Field values that fall within the defined tolerance are considered a match.
Uniqueness Usage Description Domains
Low • Define as Prerequisite
• Define with lower weights
Provides discriminatory
information
Gender, City, State
High • Define as Similar or Exact
• Define with higher weights
Provides highly identifiable
information and is highly
discriminatory
Names (First, Last,
Company),
Address Line 1
Completeness Usage Description
Low Do not use or define with low weight High level of missing values
High Include for matching if the column
provides highly identifiable
information
Low level of missing values
• The Matching Results tab displays statistics for the current and
previous run of a matching rule.
• Restore the previous rule.
Home TeamSongArtist
The DQS matching system uses the knowledge accumulated in the
knowledge base to propose matching candidates. This knowledge
includes:
Synonyms, Syntax Errors and their Leading Value (by domain)
Domain Values and their synonyms and syntax errors are used
by the matching system to find identical or similar records.
Term-Based Relations (TBR)
TBR improves consistency of data attributes values by
transforming data values to a single form using user-defined
term relations. In matching, TBRs are only applied in-memory
for boosting matching accuracy.
Nulls and Equivalents (“Unknown”, “99999”…)
Manage values that represent missing data by linking to the
‘DQS_Null’ value to assure that they are considered as a
match.
String 1 String 2 Similarity Score Character
Before After
175 CLEARBROOK ROAD P.O. BOX 535 175 CLEARBROOK ROAD P.O.BOX 535 0.92 1.00 .
1834 E. 42ND STREET 1834 E. 42ND. ST. 0.695 0.857 .
1721 DE KALB AVE, NE 1721 DE KALB AVE NE 0.88 1.00 ,
14538 S. GARFIELD AVE., BLDG. 1-B 14538 S GARFIELD AVE BLDG 1B 0.676 0.944 , . -
#704, SJ Technoville BD, 60-19 704 SJ Technoville BD 60 19 0.65 1.00 # , -
Example:
Export - export both matching results (clusters) and survivors
(unique records).
A Matching project is performed in three steps:
Mapping - map source columns to domains.
Matching - run matching and view the results; it includes additional
functionality such as:
• Reject records
• Filter results by ‘Matched’ & ‘Unmatched’ and by matching
score.
• Display clusters in two different methods (overlapping and
non- overlapping )
In Overlapping clusters a record may appear more than once in various clustered
results. This structure may be harder to read since the same record exists in multiple
clusters.
In Non-Overlapping clusters, the system unifies clusters containing the same
record. This structure is easier to read as you won't repeat the same observation
twice.
Overlapping Clusters
(A~B) , (B~C)
Non-Overlapping Cluster
(A~B~C)
Overlapping Clusters
Non-Overlapping Clusters
Check the Rejected box to move the records out of the proposed cluster upon
moving to the next page in the activity. Unlike the Cleansing Data Project where
records move between tabs instantly, the rejected records are not removed from
the clusters on the user interface.
DQS Client User Interface
Exported Matching Results
Matching and Survivorship results can be exported to a SQL table,
Excel or CSV file for further analysis or consumption.
in a matching rule
In a Matching Rule Minimum matching score parameter
http://speakerscore.com/S7PT
THANK YOU

More Related Content

What's hot

Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
Xavier Ochoa
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
Scott Gomer
 
Data-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityData-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data Quality
DATAVERSITY
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Parth Khare
 
06. Transformation Logic Template (Source to Target)
06. Transformation Logic Template (Source to Target)06. Transformation Logic Template (Source to Target)
06. Transformation Logic Template (Source to Target)
Alan D. Duncan
 
Data Warehousing Trends
Data Warehousing TrendsData Warehousing Trends
Data Warehousing Trends
Chris Riccomini
 

What's hot (6)

Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Data-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data QualityData-Ed Online: Approaching Data Quality
Data-Ed Online: Approaching Data Quality
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
06. Transformation Logic Template (Source to Target)
06. Transformation Logic Template (Source to Target)06. Transformation Logic Template (Source to Target)
06. Transformation Logic Template (Source to Target)
 
Data Warehousing Trends
Data Warehousing TrendsData Warehousing Trends
Data Warehousing Trends
 

Viewers also liked

Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)
James Serra
 
MDS & SQL 2012
MDS & SQL 2012MDS & SQL 2012
MDS & SQL 2012
Chad Dotzenrod
 
Microsoft master data services mds overview
Microsoft master data services mds overviewMicrosoft master data services mds overview
Microsoft master data services mds overview
Eugene Zozulya
 
4. Liiketoiminta ja tiedonhallintaprosessien kehittäminen
4. Liiketoiminta ja tiedonhallintaprosessien kehittäminen4. Liiketoiminta ja tiedonhallintaprosessien kehittäminen
4. Liiketoiminta ja tiedonhallintaprosessien kehittäminen
SpartaConsulting
 
1. Ydintieto (MDM) peruskäsitteet
1. Ydintieto (MDM) peruskäsitteet1. Ydintieto (MDM) peruskäsitteet
1. Ydintieto (MDM) peruskäsitteet
SpartaConsulting
 
MDM & BI Strategy For Large Enterprises
MDM & BI Strategy For Large EnterprisesMDM & BI Strategy For Large Enterprises
MDM & BI Strategy For Large Enterprises
Mark Schoeppel
 
3. Ydintiedon hallinnan LT perusteet ja kehittämisen suunnittelu
3. Ydintiedon hallinnan LT perusteet ja kehittämisen suunnittelu3. Ydintiedon hallinnan LT perusteet ja kehittämisen suunnittelu
3. Ydintiedon hallinnan LT perusteet ja kehittämisen suunnittelu
SpartaConsulting
 
Master Data Services - 2016 - Huntington Beach
Master Data Services - 2016 - Huntington BeachMaster Data Services - 2016 - Huntington Beach
Master Data Services - 2016 - Huntington BeachJeff Prom
 
Tutustuminen data-analytiikan ja big datan maailmaan
Tutustuminen data-analytiikan ja big datan maailmaanTutustuminen data-analytiikan ja big datan maailmaan
Tutustuminen data-analytiikan ja big datan maailmaan
Jari Jussila
 
Introduction to Master Data Services in SQL Server 2012
Introduction to Master Data Services in SQL Server 2012Introduction to Master Data Services in SQL Server 2012
Introduction to Master Data Services in SQL Server 2012
Stéphane Fréchette
 
Agile data science
Agile data scienceAgile data science
Agile data science
Joel Horwitz
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
Christopher Bradley
 
Kevyitä lähtöjä analytiikkaan
Kevyitä lähtöjä analytiikkaanKevyitä lähtöjä analytiikkaan
Kevyitä lähtöjä analytiikkaan
Jukka Huhtamäki
 

Viewers also liked (13)

Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)Introduction to Microsoft’s Master Data Services (MDS)
Introduction to Microsoft’s Master Data Services (MDS)
 
MDS & SQL 2012
MDS & SQL 2012MDS & SQL 2012
MDS & SQL 2012
 
Microsoft master data services mds overview
Microsoft master data services mds overviewMicrosoft master data services mds overview
Microsoft master data services mds overview
 
4. Liiketoiminta ja tiedonhallintaprosessien kehittäminen
4. Liiketoiminta ja tiedonhallintaprosessien kehittäminen4. Liiketoiminta ja tiedonhallintaprosessien kehittäminen
4. Liiketoiminta ja tiedonhallintaprosessien kehittäminen
 
1. Ydintieto (MDM) peruskäsitteet
1. Ydintieto (MDM) peruskäsitteet1. Ydintieto (MDM) peruskäsitteet
1. Ydintieto (MDM) peruskäsitteet
 
MDM & BI Strategy For Large Enterprises
MDM & BI Strategy For Large EnterprisesMDM & BI Strategy For Large Enterprises
MDM & BI Strategy For Large Enterprises
 
3. Ydintiedon hallinnan LT perusteet ja kehittämisen suunnittelu
3. Ydintiedon hallinnan LT perusteet ja kehittämisen suunnittelu3. Ydintiedon hallinnan LT perusteet ja kehittämisen suunnittelu
3. Ydintiedon hallinnan LT perusteet ja kehittämisen suunnittelu
 
Master Data Services - 2016 - Huntington Beach
Master Data Services - 2016 - Huntington BeachMaster Data Services - 2016 - Huntington Beach
Master Data Services - 2016 - Huntington Beach
 
Tutustuminen data-analytiikan ja big datan maailmaan
Tutustuminen data-analytiikan ja big datan maailmaanTutustuminen data-analytiikan ja big datan maailmaan
Tutustuminen data-analytiikan ja big datan maailmaan
 
Introduction to Master Data Services in SQL Server 2012
Introduction to Master Data Services in SQL Server 2012Introduction to Master Data Services in SQL Server 2012
Introduction to Master Data Services in SQL Server 2012
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
 
Kevyitä lähtöjä analytiikkaan
Kevyitä lähtöjä analytiikkaanKevyitä lähtöjä analytiikkaan
Kevyitä lähtöjä analytiikkaan
 

Similar to Dqs mds-matching 15042015

Data quality and bi
Data quality and biData quality and bi
Data quality and bijeffd00
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)
LizLavaveshkul
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
AnwarrChaudary
 
Data Analysis using Data Flux
Data Analysis using Data FluxData Analysis using Data Flux
Data Analysis using Data FluxSunil Pai
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
RTTS
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
JaveriaGauhar
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
VijayasankariS
 
Data imputation for unstructured dataset
Data imputation for unstructured datasetData imputation for unstructured dataset
Data imputation for unstructured dataset
Vibhore Agarwal
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
Amin Chowdhury
 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligence
Ahsan Kabir
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Tony Nguyen
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
James Wong
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Young Alista
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Luis Goldster
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Fraboni Ec
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Hoang Nguyen
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackBig Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Precisely
 

Similar to Dqs mds-matching 15042015 (20)

Data quality and bi
Data quality and biData quality and bi
Data quality and bi
 
Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)Etl Overview (Extract, Transform, And Load)
Etl Overview (Extract, Transform, And Load)
 
Intro to Data warehousing lecture 10
Intro to Data warehousing   lecture 10Intro to Data warehousing   lecture 10
Intro to Data warehousing lecture 10
 
Data Analysis using Data Flux
Data Analysis using Data FluxData Analysis using Data Flux
Data Analysis using Data Flux
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Data quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data qualityData quality testing – a quick checklist to measure and improve data quality
Data quality testing – a quick checklist to measure and improve data quality
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Data imputation for unstructured dataset
Data imputation for unstructured datasetData imputation for unstructured dataset
Data imputation for unstructured dataset
 
Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Preprocess
PreprocessPreprocess
Preprocess
 
Overview of business intelligence
Overview of business intelligenceOverview of business intelligence
Overview of business intelligence
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big HaystackBig Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
 

Recently uploaded

Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Orkestra
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 

Recently uploaded (13)

Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 

Dqs mds-matching 15042015

  • 1. DQS/MDS INTRO’S & DQS MATCHING Microsoft SQL Server 2012 SQL Server 2014 Neil Hambly SQL Server Evangelist / Practice Lead PASS London Chapter Leader Melissa Data MVP PASS Virtual Chapter “Professional Development” Leader Contributing Author
  • 2. Agenda Matching Project What is record matching? Data Issues DQS Matching Process DQS Data Matching Principles Matching Policy DQS Intro MDS Intro
  • 3. Data Cleansing: Modifications, removal, correcting data or incomplete, either computer-assisted or interactively. Matching: Identification of duplicates in a rules-based process, perform de-duplication., verifying data quality using reference data provider. Use reference data services from Azure Marketplace providers Profiling: Analysis of data for insight into its data quality , domain management, matching, and data cleansing processes. Profiling is a powerful tool in a DQS data quality solution. Monitoring: Determine the state of data quality activities. Validate data quality solution is doing what it was designed to do. Knowledge Base: DQS is a knowledge-driven solution , analyzing data using knowledge built with DQS. Create data quality processes to enhance the knowledge of data , continuously improving data quality
  • 4. • Create a Matching Policy • Data Quality Matching • Match Similar Data
  • 5. Master Data Services Configuration Manager Tool to create and configure Master Data Services databases and web applications. Master Data Manager Web application for performing administrative tasks (creating a model or business rule), and that users access to modify master data. MDSModelDeploy.exe Tool to create packages of your model objects and data, for deploying to other environments. Master Data Services web service Developers can use to develop custom solutions for Master Data Services. Master Data Services Add-in for Excel Manage data and create new entities and attributes.
  • 6.
  • 7.
  • 8.
  • 10. Record matching is the task of identifying records that match the same real world entity.
  • 11. The Cost of Duplicate Data …a few examples… Direct marketing communications are doubled up unnecessarily. Product shipments and customer-site based services could be sent to the wrong address due to an incorrect duplicate record being used. Your sales reporting may be inaccurate due to an over- inflated number of customers. Inaccurate sales analysis due to sales being split between multiple records that represent the same customer, resulting in an undervaluing of some key customers.
  • 12. Where do Duplicate Records come from? Poorly designed software No verification of existing records upon entry Formatting & abbreviations "Doctor Robert Smith" Vs. "Dr. Bob Smith". Data validation Human errors can creep into the system when fields’ input is not validated Company merging and acquisitions Merging systems may result in duplicates in the merged data. Change of attributes The same person may appear to not exist in the database if some of the attributes were changed (e.g., address, name etc.)
  • 13. …Data Issues… There are different ways to represent the same person or address in a database: Data is ‘fuzzy’ in nature (spelling mistakes, abbreviations etc.).
  • 14. How Data Issues Affects Matching? Matching Results Matching Results Reasoning The Data
  • 15.
  • 17. Identifies exact and approximate matches, enabling removal of duplicate data. Enables creating a matching policy interactively using a computer-assisted process. Ensures that values that are equivalent, but were entered in a different format or style, are in fact rendered uniform.
  • 18.
  • 19. A matching policy is prepared in the knowledge base. A matching policy consists of matching rules that assess how well one record matches to another. Specify in the rule whether records’ values have to be an exact match, similar, or prerequisite. Train your policy by running and tuning each rule separately.
  • 20. Identify the attributes in your data that are most significant for matching. Create domains/composite domains based on your data structure. Define matching rules. Birth Date Gender Composite Domain Full Name F. Name M. Name L. Name Email Phone Composite Domain Full Address Street City State Country
  • 21. Similarity, select Similar if field values can be similar. Select Exact if field values must be identical. Weight, determines the contribution of each domain in the rule to the overall matching score for two records. Prerequisite validates whether field values return a 100% match; else the records are not considered a match. Minimum matching score is the threshold at or above which two records are considered to be a match.
  • 22. Domains of type ‘Date’, ‘Integer’ or ‘Decimal’ can be matched using the ‘Similar’ property by assigning a tolerance either in percentage or integer. Field values that fall within the defined tolerance are considered a match.
  • 23. Uniqueness Usage Description Domains Low • Define as Prerequisite • Define with lower weights Provides discriminatory information Gender, City, State High • Define as Similar or Exact • Define with higher weights Provides highly identifiable information and is highly discriminatory Names (First, Last, Company), Address Line 1 Completeness Usage Description Low Do not use or define with low weight High level of missing values High Include for matching if the column provides highly identifiable information Low level of missing values
  • 24. • The Matching Results tab displays statistics for the current and previous run of a matching rule. • Restore the previous rule.
  • 26.
  • 27. The DQS matching system uses the knowledge accumulated in the knowledge base to propose matching candidates. This knowledge includes: Synonyms, Syntax Errors and their Leading Value (by domain) Domain Values and their synonyms and syntax errors are used by the matching system to find identical or similar records. Term-Based Relations (TBR) TBR improves consistency of data attributes values by transforming data values to a single form using user-defined term relations. In matching, TBRs are only applied in-memory for boosting matching accuracy. Nulls and Equivalents (“Unknown”, “99999”…) Manage values that represent missing data by linking to the ‘DQS_Null’ value to assure that they are considered as a match.
  • 28. String 1 String 2 Similarity Score Character Before After 175 CLEARBROOK ROAD P.O. BOX 535 175 CLEARBROOK ROAD P.O.BOX 535 0.92 1.00 . 1834 E. 42ND STREET 1834 E. 42ND. ST. 0.695 0.857 . 1721 DE KALB AVE, NE 1721 DE KALB AVE NE 0.88 1.00 , 14538 S. GARFIELD AVE., BLDG. 1-B 14538 S GARFIELD AVE BLDG 1B 0.676 0.944 , . - #704, SJ Technoville BD, 60-19 704 SJ Technoville BD 60 19 0.65 1.00 # , - Example:
  • 29.
  • 30. Export - export both matching results (clusters) and survivors (unique records). A Matching project is performed in three steps: Mapping - map source columns to domains. Matching - run matching and view the results; it includes additional functionality such as: • Reject records • Filter results by ‘Matched’ & ‘Unmatched’ and by matching score. • Display clusters in two different methods (overlapping and non- overlapping )
  • 31. In Overlapping clusters a record may appear more than once in various clustered results. This structure may be harder to read since the same record exists in multiple clusters. In Non-Overlapping clusters, the system unifies clusters containing the same record. This structure is easier to read as you won't repeat the same observation twice. Overlapping Clusters (A~B) , (B~C) Non-Overlapping Cluster (A~B~C)
  • 33. Check the Rejected box to move the records out of the proposed cluster upon moving to the next page in the activity. Unlike the Cleansing Data Project where records move between tabs instantly, the rejected records are not removed from the clusters on the user interface. DQS Client User Interface Exported Matching Results
  • 34. Matching and Survivorship results can be exported to a SQL table, Excel or CSV file for further analysis or consumption.
  • 35. in a matching rule In a Matching Rule Minimum matching score parameter