Data Quality Services in SQL Server 2012
(An Introduction)
Stéphane Fréchette
Friday April 26, 2013
Matching
Cleansing
DQS
Who am I?
My name is Stéphane Fréchette
I’m a Database & Business Intelligence Professional and CEO | Founder of
I have a passion for architecting, designing and building solutions that matter.
Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led
initiative which aims to promote open access to civic data of the city of Gatineau.
Twitter: @sfrechette
Email: stephanefrechette@ukubu.com
Blog: stephanefrechette.com
Session Outline
• Microsoft Business Intelligence (The Stack)
• Dirty Data…
• SQL Server Data Quality Services (DQS)
• Data Steward
• Knowledge Base and Domains
• Data Quality Projects
• Data Cleansing Transform – SSIS
• DQS (Install & Architecture)
• Enterprise Information Management (EMI)
• Resources
Analysis
Services
Reporting
Services
Integration
Services
Master Data
Services
SharePoint
Collaboration
Excel
Workbooks
PowerPivot
Applications
SharePoint
Dashboards & Scorecards
Data Quality
Services
OData
Feeds
Line of Business
Applications
Hadoop Big Data
Microsoft Business Intelligence
Dirty Data…
Do you have dirty data?
(all projects have it! Its inevitable)
Dirty Data…
Causes?
Bad data entry
Poor Data Governance
Duplicate entities in different LOB systems
Sample Data Representation
• Prospect in CRM System:
Mark Smith | 613.111-1234 | Ottawa | ON | K1P 1K1
• Prospect buys goods now entered in POS System:
Markus Smith | 1234 Stilton Ave | Kanata |ON | K1P 1K1
• Record also entered into Accounting System:
Markus Smith | 1234 Stilton Avenue | Kanata | ON | K1P 1K1
ETL process imports these records into the Data Warehouse / Data Mart
FirstName LastName Phone Address City Province PostalCode
Mark Smith 613.111-1234 Ottawa ON K1P 1K1
Markus Smith 1234 Stilton Ave Kanata ON K1P 1K1
Markus Smith 1234 Stilton Avenue Kanata ON K1P 1K1
Sample Data Representation
• Duplicate records and inaccurate, incomplete data
• What we want is a golden record (one version of the truth)
FirstName LastName Phone Address City Province PostalCode
Mark Smith 613.111-1234 Ottawa ON K1P 1K1
Markus Smith 1234 Stilton Ave Kanata ON K1P 1K1
Markus Smith 1234 Stilton Avenue Kanata ON K1P 1K1
FirstName LastName Phone Address City Province PostalCode
Markus Smith 613-111-1234 1234 Stilton Ave Kanata ON K1P 1K1
SQL Server Data Quality Services (DQS)
• New in SQL Server 2012
• Enables cleansing, matching, standardizing and enriching data
• Delivers trusted information for business intelligence, data warehouse, transaction
processing workloads
• Knowledge-Driven Solution (create/edit)
• A knowledge management process that builds the knowledge base
• A data quality project that proposes changes to source data based on the knowledge in the knowledge
base (cleansing and matching)
• A key component to an Enterprise Information Management (EIM) solution
Answering the Need with DQS
• DQS enables to resolve issues involving incompleteness, lack of conformity, inconsistency,
inaccuracy, invalidity, and data duplication
• Provides the following features to resolve data quality issues:
 Data Cleansing
 Matching
 Reference Data Services
 Profiling
 Monitoring
 Knowledge Base
Data Steward
• Key role - Is usually a Business User and not from the Information Technology side
• Nutshell: Responsible for maintaining data elements in a metadata registry…
• Data Steward -> DQS Client
• Create and edit Knowledge Bases
• Run and process data though continually, iteratively, improving the Knowledge Bases
• Knowledge Bases can be consumed and used by other Data Stewards and IT (SSIS / ETL Developers)
DQS
Data Steward
MDS
Data Steward
SSIS
Developer
Matching Cleansing
Knowledge Bases and Domains
The knowledge base is a repo of knowledge about your data that enables you to understand
your data and maintain its integrity.
• Processes:
• Computer-assisted
• Interactive
• Components:
• Knowledge Discovery
• Domain Management
• Reference Data Services
• Matching Policy
Demo
Knowledge Base Management
(Creating a Knowledge Base)
Data Quality Projects
Improve quality of source data by performing data cleansing and data matching activities
using defined knowledge bases
• Cleansing Activity (2 step process)
• Computer-assisted : data is categorized (suggested, new, invalid, corrected, and correct)
• Interactive: data steward to approve, reject, or modify the proposed results from the computer-assisted
cleansing process
• Matching Activity
• Using existing knowledge base matching policy
• Prevent and remove data duplication
• Data Profiling and Notifications
• Profiling provides data quality stats and info: completeness and accuracy
• Notification on actions that can be taken to enhance operations
Demo
Data Quality Project
(Cleansing and Matching)
DQS Cleansing Transform in SSIS
• When you want to automate the cleansing and matching process
and not use the DQS Client
• Use SSIS for batch data cleansing
• Matching can be done with Master Data Services (MDS)
• SSIS can be leveraged to bring DQS and MDS together
*DQS does not expose matching functionality for SSIS, but you can use Fuzzy Grouping Transform to
identify duplicate data
*Cleansing Transform is single threaded – use multiple transform for parallelism
Demo
Data Cleansing Transform
(Automating the Cleansing and Matching using SSIS)
Installing DQS
• Requires Business Intelligence or Enterprise/Developer version of SQL Server 2012
• During SQL Server setup;
• Instance Features -> Data Quality Services
• Shared Features -> Data Quality Client
• Execute the Data Quality Server Installer;
• C:Program FilesMicrosoft SQL ServerMSSQL11.MSSQLSERVERMSSQLBinnDQSInstaller.exe
• Data Quality Service – Data Quality Server Installer
(Apps - Microsoft SQL Server 2012)
DQS Architecture
DQS Server
DQS Catalog (3 databases)
• DQS_MAIN (Knowledge Bases)
• DQS_PROJECTS (Projects)
• DQS_STAGING_DATA (Sandbox, scratch pad area)
Security – Database Roles
• dqs_administrator
• dqs_kb_editor
• dqs_kb_operator
Windows Azure Marketplace
Reference Data Services -> validating, cleansing and enriching your data
Performance considerations - FYI
• Major performance improvements from RTM to CU1 release of SQL Server 2012 (strongly
recommend patching and upgrading) http://bit.ly/11eEhHC
• Must read -> DQS Performance Best Practice Guide http://bit.ly/16Gwenl
• Understand data volumes and hardware requirements… plan wisely!
Enterprise Information Management (EIM)
The EIM Stack as a whole is the ‘Master Data Management’ solution from Microsoft and
consist of the following:
• SQL Server Data Quality Services (DQS) - Capture and record knowledge, rules, and actions
• SQL Server Master Data Services (MDS) - Master Data Management repository, Dimension data
• SQL Server Integration Services (SSIS) – Moves data, integration
Enterprise Information Management (EMI)
‘Master Data Management’
Resources
• Data Quality Services Team Blog (MSDN) http://bit.ly/WCI2nO
• SQL Server Data Quality Services (TechNet) http://bit.ly/ZaUO8k
• DQS Performance Best Practices Guide http://bit.ly/16Gwenl
• Enterprise Information Management (EIM) Bringing Together SSIS, DQS, and
MDS (Video – Channel 9) http://bit.ly/NJXvKr
• Matt Masson – Getting Started with DQS and MDS http://bit.ly/149Ga9n
• Paras Doshi’s – Blog (DQS) http://bit.ly/YoLthh
What Questions Do You Have?
Thank You
For attending this session

Data Quality Services in SQL Server 2012

  • 1.
    Data Quality Servicesin SQL Server 2012 (An Introduction) Stéphane Fréchette Friday April 26, 2013 Matching Cleansing DQS
  • 2.
    Who am I? Myname is Stéphane Fréchette I’m a Database & Business Intelligence Professional and CEO | Founder of I have a passion for architecting, designing and building solutions that matter. Self proclaimed Open Data Hacker/Advocate I founded Gatineau Ouverte a citizen led initiative which aims to promote open access to civic data of the city of Gatineau. Twitter: @sfrechette Email: stephanefrechette@ukubu.com Blog: stephanefrechette.com
  • 3.
    Session Outline • MicrosoftBusiness Intelligence (The Stack) • Dirty Data… • SQL Server Data Quality Services (DQS) • Data Steward • Knowledge Base and Domains • Data Quality Projects • Data Cleansing Transform – SSIS • DQS (Install & Architecture) • Enterprise Information Management (EMI) • Resources
  • 4.
    Analysis Services Reporting Services Integration Services Master Data Services SharePoint Collaboration Excel Workbooks PowerPivot Applications SharePoint Dashboards &Scorecards Data Quality Services OData Feeds Line of Business Applications Hadoop Big Data Microsoft Business Intelligence
  • 5.
    Dirty Data… Do youhave dirty data? (all projects have it! Its inevitable)
  • 6.
    Dirty Data… Causes? Bad dataentry Poor Data Governance Duplicate entities in different LOB systems
  • 7.
    Sample Data Representation •Prospect in CRM System: Mark Smith | 613.111-1234 | Ottawa | ON | K1P 1K1 • Prospect buys goods now entered in POS System: Markus Smith | 1234 Stilton Ave | Kanata |ON | K1P 1K1 • Record also entered into Accounting System: Markus Smith | 1234 Stilton Avenue | Kanata | ON | K1P 1K1 ETL process imports these records into the Data Warehouse / Data Mart FirstName LastName Phone Address City Province PostalCode Mark Smith 613.111-1234 Ottawa ON K1P 1K1 Markus Smith 1234 Stilton Ave Kanata ON K1P 1K1 Markus Smith 1234 Stilton Avenue Kanata ON K1P 1K1
  • 8.
    Sample Data Representation •Duplicate records and inaccurate, incomplete data • What we want is a golden record (one version of the truth) FirstName LastName Phone Address City Province PostalCode Mark Smith 613.111-1234 Ottawa ON K1P 1K1 Markus Smith 1234 Stilton Ave Kanata ON K1P 1K1 Markus Smith 1234 Stilton Avenue Kanata ON K1P 1K1 FirstName LastName Phone Address City Province PostalCode Markus Smith 613-111-1234 1234 Stilton Ave Kanata ON K1P 1K1
  • 9.
    SQL Server DataQuality Services (DQS) • New in SQL Server 2012 • Enables cleansing, matching, standardizing and enriching data • Delivers trusted information for business intelligence, data warehouse, transaction processing workloads • Knowledge-Driven Solution (create/edit) • A knowledge management process that builds the knowledge base • A data quality project that proposes changes to source data based on the knowledge in the knowledge base (cleansing and matching) • A key component to an Enterprise Information Management (EIM) solution
  • 10.
    Answering the Needwith DQS • DQS enables to resolve issues involving incompleteness, lack of conformity, inconsistency, inaccuracy, invalidity, and data duplication • Provides the following features to resolve data quality issues:  Data Cleansing  Matching  Reference Data Services  Profiling  Monitoring  Knowledge Base
  • 11.
    Data Steward • Keyrole - Is usually a Business User and not from the Information Technology side • Nutshell: Responsible for maintaining data elements in a metadata registry… • Data Steward -> DQS Client • Create and edit Knowledge Bases • Run and process data though continually, iteratively, improving the Knowledge Bases • Knowledge Bases can be consumed and used by other Data Stewards and IT (SSIS / ETL Developers) DQS Data Steward MDS Data Steward SSIS Developer Matching Cleansing
  • 12.
    Knowledge Bases andDomains The knowledge base is a repo of knowledge about your data that enables you to understand your data and maintain its integrity. • Processes: • Computer-assisted • Interactive • Components: • Knowledge Discovery • Domain Management • Reference Data Services • Matching Policy
  • 13.
  • 14.
    Data Quality Projects Improvequality of source data by performing data cleansing and data matching activities using defined knowledge bases • Cleansing Activity (2 step process) • Computer-assisted : data is categorized (suggested, new, invalid, corrected, and correct) • Interactive: data steward to approve, reject, or modify the proposed results from the computer-assisted cleansing process • Matching Activity • Using existing knowledge base matching policy • Prevent and remove data duplication • Data Profiling and Notifications • Profiling provides data quality stats and info: completeness and accuracy • Notification on actions that can be taken to enhance operations
  • 15.
  • 16.
    DQS Cleansing Transformin SSIS • When you want to automate the cleansing and matching process and not use the DQS Client • Use SSIS for batch data cleansing • Matching can be done with Master Data Services (MDS) • SSIS can be leveraged to bring DQS and MDS together *DQS does not expose matching functionality for SSIS, but you can use Fuzzy Grouping Transform to identify duplicate data *Cleansing Transform is single threaded – use multiple transform for parallelism
  • 17.
    Demo Data Cleansing Transform (Automatingthe Cleansing and Matching using SSIS)
  • 18.
    Installing DQS • RequiresBusiness Intelligence or Enterprise/Developer version of SQL Server 2012 • During SQL Server setup; • Instance Features -> Data Quality Services • Shared Features -> Data Quality Client • Execute the Data Quality Server Installer; • C:Program FilesMicrosoft SQL ServerMSSQL11.MSSQLSERVERMSSQLBinnDQSInstaller.exe • Data Quality Service – Data Quality Server Installer (Apps - Microsoft SQL Server 2012)
  • 19.
    DQS Architecture DQS Server DQSCatalog (3 databases) • DQS_MAIN (Knowledge Bases) • DQS_PROJECTS (Projects) • DQS_STAGING_DATA (Sandbox, scratch pad area) Security – Database Roles • dqs_administrator • dqs_kb_editor • dqs_kb_operator
  • 20.
    Windows Azure Marketplace ReferenceData Services -> validating, cleansing and enriching your data
  • 21.
    Performance considerations -FYI • Major performance improvements from RTM to CU1 release of SQL Server 2012 (strongly recommend patching and upgrading) http://bit.ly/11eEhHC • Must read -> DQS Performance Best Practice Guide http://bit.ly/16Gwenl • Understand data volumes and hardware requirements… plan wisely!
  • 22.
    Enterprise Information Management(EIM) The EIM Stack as a whole is the ‘Master Data Management’ solution from Microsoft and consist of the following: • SQL Server Data Quality Services (DQS) - Capture and record knowledge, rules, and actions • SQL Server Master Data Services (MDS) - Master Data Management repository, Dimension data • SQL Server Integration Services (SSIS) – Moves data, integration Enterprise Information Management (EMI) ‘Master Data Management’
  • 23.
    Resources • Data QualityServices Team Blog (MSDN) http://bit.ly/WCI2nO • SQL Server Data Quality Services (TechNet) http://bit.ly/ZaUO8k • DQS Performance Best Practices Guide http://bit.ly/16Gwenl • Enterprise Information Management (EIM) Bringing Together SSIS, DQS, and MDS (Video – Channel 9) http://bit.ly/NJXvKr • Matt Masson – Getting Started with DQS and MDS http://bit.ly/149Ga9n • Paras Doshi’s – Blog (DQS) http://bit.ly/YoLthh
  • 24.
  • 25.