Successfully reported this slideshow.

Data Profiling, Data Catalogs and Metadata Harmonisation

2

Share

Loading in …3
×
1 of 63
1 of 63

Data Profiling, Data Catalogs and Metadata Harmonisation

2

Share

Download to read offline

These notes discuss the related topics of Data Profiling, Data Catalogs and Metadata Harmonisation. It describes a detailed structure for data profiling activities. It identifies various open source and commercial tools and data profiling algorithms. Data profiling is a necessary pre-requisite activity in order to construct a data catalog. A data catalog makes an organisation’s data more discoverable. The data collected during data profiling forms the metadata contained in the data catalog. This assists with ensuring data quality. It is also a necessary activity for Master Data Management initiatives. These notes describe a metadata structure and provide details on metadata standards and sources.

These notes discuss the related topics of Data Profiling, Data Catalogs and Metadata Harmonisation. It describes a detailed structure for data profiling activities. It identifies various open source and commercial tools and data profiling algorithms. Data profiling is a necessary pre-requisite activity in order to construct a data catalog. A data catalog makes an organisation’s data more discoverable. The data collected during data profiling forms the metadata contained in the data catalog. This assists with ensuring data quality. It is also a necessary activity for Master Data Management initiatives. These notes describe a metadata structure and provide details on metadata standards and sources.

More Related Content

Data Profiling, Data Catalogs and Metadata Harmonisation

  1. 1. Data Profiling, Data Catalogs and Metadata Harmonisation Alan McSweeney http://ie.linkedin.com/in/alanmcsweeney https://www.amazon.com/dp/1797567616
  2. 2. Data Profiling, Data Catalogs and Metadata Harmonisation May 3, 2021 2 Data Profiling Understand Your Data Data Catalog Database of Data Assets Metadata Harmonisation Standardisation of Data Descriptions
  3. 3. Data Profiling • The preparation of data prior to it being in a usable and analysable format can consume up to 80% of the resources of a data project May 3, 2021 3
  4. 4. Data Profiling • Process for discovering and examining the data available in existing data sources • Essential initial activity − Understand the structure and contents of data − Evaluate data quality and data conformance with standards − Identify terms and metadata used to describe data − Identify data relationships and dependencies − Enable the creation of a master data view across all data sources − Understand and define data integration requirements • Be able to understand data issues, problems, challenges at the start of a data project: − Data cleansing − Data analytics − Master data management − Data catalog − Data migration May 3, 2021 4
  5. 5. Data Profiling – Wider Context May 3, 2021 5 Source System Data Profiling Common Data Model, Data Storage and Access Platform Visualisation and Reporting Analysis Data Access Data Integration Common Data Integration Understand System Data Structures, Values Profiling Assists with Building Long-Term Data Model Profiling Assists with Building Data Dictionary/Catalog to Enable Data Subject Access and Data Discovery 1 2 3 5 Assists with Data Extraction and Integration Definition 4 Data Catalog Data Virtualisation Layer 6 Data Catalog is an Enabler of Data Virtualisation
  6. 6. Importance Of Data Profiling May 3, 2021 6 • Data profiling is a central activity that is key to downstream and long-term data usability and has impact on topics of Data Quality, Data Lineage/Data Provenance and Master Data Management Data Profiling Activity Data Quality Data Lineage/Data Provenance Contributes to and Ensures Data Quality Allows Tracking of Data Lineage • Data lineage and data provenance involving tracking data origins, what happens to the data and how it flows between systems over time • Data lineage provides data visibility • It simplifies tracing data errors that may occur in data reporting, visualisation and analytics Master Data Management Enabled the Implementation of Master Data Management
  7. 7. Data Profiling Toolset Options – Partial List May 3, 2021 7 • Large number of data profiling tool options • You can investigate these tools to understand which are the most suitable and which functions are important prior to any formal tool selection process − Download and use trial versions of commercial tools • This work will require resources Free/Open Source Tools Aggregate Profiler Quadient DataCleaner Talend Open Studio Commercial Tools Atlan Collibra Data Stewardship Manager IBM InfoSphere Information Analyser Informatica Data Explorer Melissa Data Profiler Oracle Enterprise Data Quality SAP Business Objects Data Services (BODS) SAS DataFlux TIBCO Clarity WinPure
  8. 8. Data Profiling Stages Data Access and Retrieval •Defining the data sources to be profiled Performing the Profiling •Working through the programme of data profiling activities Understanding and Interpreting the Results •Collating, documenting and using the results May 3, 2021 8
  9. 9. Layers Of Data Profiling Activities • Profiling starts with individual data fields/columns and then extends outwards to tables/files, then data stores/databases to the upstream data sources and downstream data targets and finally the entire set of organisation data entities May 3, 2021 9 Organisation Data Landscape Data Sources and Targets Data Store Individual Data Structures (Tables) Individual Data Fields
  10. 10. Data Profiling Across The Organisation Data Landscape May 3, 2021 10 Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity
  11. 11. Data Profiling Across The Organisation Data Landscape • Data profiling activities are normally performed on single data entities • The organisation data landscape consists of multiple, generally heterogenous, loosely interconnected data entities between which where data moves • Data breathes life into the organisation’s solution landscape • One profiled data entity can take its data from a number of upstream data sources and in turn be the source for a number of downstream data targets • Profiling may involve tracing data lineage across a number of data entities to create and end-to-end data provenance May 3, 2021 11
  12. 12. Individual Data Profiling Exercise Can Leak Into Other Data Domains May 3, 2021 12 Data Profiling Activity Data Profiling Activity Data Profiling Activity Core Data Profiling Activity Data Profiling Activity Data Profiling Activity Data Profiling Activity Upstream Data Profiling Activities Downstream Data Profiling Activities
  13. 13. Data Profiling Activities Data Profiling Activities Individual Field Analysis Data Type, Length, Input Validation, Constraints Number and Count of Values, Null/Missing Values, Maxima, Minima, Ranges, Distributions Data Categories, Values, Data Dictionaries, Reference Sources Data Value Patterns Data Structures Data Aggregations Keys Data Indexes Triggers Inter Field Linkages, Relationships, Correlations and Dependencies Unique Combinations of Field Values Functional Field Dependencies Inclusion Dependencies Cross Field Inconsistent Values Data Completeness, Consistency and Accuracy Missing and Incomplete Series Values and Gaps Inconsistent Data Values Inaccurate Data Values Duplicate Values Distribution and Occurrence Checking Data Context Data Sources Data Processing and Transformation, Business Rules Data Description and Documentation Metadata Definition and Creation Data Targets and Usage Data Criticality Data Security Data Statistics Data Capacity Statistics Data Usage Statistics Data Update Statistics Data Growth Statistics Data Processing Statistics Data Overheads Data Audit Logging Data Infrastructure Data Storage Infrastructure Data Locations Data Processing Infrastructure Data Operations Backup and Recovery Replication Availability and, Continuity Data Maintenance and Housekeeping Activities Service Levels Data Incident History Data Technologies Data Integration Data Storage Data Access Problem Identification and Remediation Identify data Problems Identify Remediation Activities May 3, 2021 13
  14. 14. Data Profiling Activities • This represents a set of data profiling tasks to create a complete view of the data contents of a data entity • This allows a realistic programme of work required to complete the data profiling activity − Resource requirements can be quantified − Duration can be estimated − Informed decision can be made on what activities to include or exclude May 3, 2021 14
  15. 15. Data Profiling – Individual Field Analysis • Analyse individuals data fields or columns − Data Type, Length, Input Validation, Constraints • Classify the field formats − Number and Count of Values, Null/Missing Values, Maxima, Minima, Ranges, Distributions • Analyse and document the field values and determine any errors and inconsistencies − Data Categories, Values, Data Dictionaries, Reference Sources • Identify lists of values used in fields and their sources and determine any − Data Value Patterns • Seek to identify patterns in field values May 3, 2021 15
  16. 16. Data Profiling – Data Structures • Analyse data structures – tables or files − Data Aggregations • Analyse data structures – number of fields/columns, frequencies of values across lines/rows − Keys • Identify data structure keys, their values, frequencies, relevance and usefulness for data access − Data Indexes • Analyse data structure indexes, their values and their usefulness for data retrieval − Triggers • Determine if triggers have been defined for fields and analyse their purpose, frequency, efficiency and utility May 3, 2021 16
  17. 17. Data Profiling – Inter Field Linkages, Relationships, Correlations and Dependencies • Identify relationships between fields/columns of data structures/tables • Relationship and dependency identification can be complex because of data volumes and large number of data values and combinations − Unique Combinations of Field Values • Identify combinations of fields/columns that uniquely identify lines/rows − Functional Field Dependencies • Identify circumstances where one field/volume value affects others − Inclusion Dependencies • Identify where some field/column values are contained in others (such as foreign keys) − Cross Field Inconsistent Values • Identify field/column values across separate data structures that are inconsistent May 3, 2021 17
  18. 18. Data Profiling – Inter Field Linkages, Relationships, Correlations and Dependencies − Unique Combinations of Field Values • DUCC • GORDIAN • HCA • HyUCC • SWAN − Functional Field Dependencies • DEP-MINER • DFD • FDEP • FDMINE • FASTFDs • FUN • HyFD − Inclusion Dependencies • B&B • BINDER • CLIM • DeMARCH • MIND • MIND2 • S-INDD • SPIDER • ZIGZAG May 3, 2021 18 • There are many algorithms that can be used to simplify the activity of identifying cross-field dependencies • These are frequently included in data profiling tools
  19. 19. Data Profiling – Data Completeness, Consistency and Accuracy • Analyse data within data structures to identify any gaps and inaccuracies − Missing and Incomplete Series Values and Gaps • Determine any missing values in data series − Inconsistent Data Values • Examine data values for inconsistencies − Inaccurate Data Values • Examine data values for inaccuracy − Duplicate Values • Identify potential duplicate values − Distribution and Occurrence Checking • Create and analyse data value distributions May 3, 2021 19
  20. 20. Data Profiling – Data Context • Analyse the wider context of the data being profiled − Data Sources • Identify the sources of the data (and their sources) − Data Processing and Transformation, Business Rules • Determine how the data is created from its sources − Data Description and Documentation • Describe and document the data − Metadata Definition and Creation • Identify any existing metadata and create/update − Data Targets and Usage • Identify how the data is used by downstream targets and activities − Data Criticality • Identify the importance and criticality of the data to business operations − Data Security • Identify the current and required data security and access control requirements May 3, 2021 20
  21. 21. Data Profiling – Data Statistics • Collect and analyse statistics on data − Data Capacity Statistics • Collect and analyse the volumes of data being stored in the structures within the data entity − Data Usage Statistics • Collect and analyse the rate of usage of data − Data Update Statistics • Collect and analyse the rate, frequency and extent of data changes − Data Growth Statistics • Collect and analyse the current and projected rates of growth of data volumes and data usage − Data Processing Statistics • Collect and analyse data processing statistics – time to update − Data Overheads • Collect and analyse data resource overheads associated with activities such as indexes and log shipping, − Data Audit Logging • Collect and analyse details on logging configuration and on data activity and usage data May 3, 2021 21
  22. 22. Data Profiling – Data Infrastructure • Analyse the underlying data infrastructure including data service providers − Data Storage Infrastructure • Document the current data storage infrastructure and platforms − Data Locations • Document the data storage locations − Data Processing Infrastructure • Document the infrastructure and platforms used to process data including any performance and throughput bottlenecks May 3, 2021 22
  23. 23. Data Profiling – Data Operations • Analyse current data operations activities and processes and technologies being used − Backup and Recovery • Document data entity backup and recovery including any testing and validation of processes − Replication • Document data entity replication to other locations including any testing and validation of processes − Availability and, Continuity • Document actual and desired data availability and continuity of access − Data Maintenance and Housekeeping Activities • Document processes and activities relating the maintenance and housekeeping of the data entity − Service Levels • Document actual and desired data service levels across access and usage − Data Incident History • Analyse service and incident history relating to the data entity including frequency, severity, impact and time to resolve and the impact on overall data reliability May 3, 2021 23
  24. 24. Data Profiling – Data Technologies • Analyse the technologies in use for the data being profiled − Data Integration • Document and analyse data integration technologies − Data Storage • Document and analyse data storage technologies − Data Access • Document and analyse data access technologies May 3, 2021 24
  25. 25. Data Profiling – Problem Identification and Remediation • Collate information on any problems and issues identified during the data profiling activities − Identify Data Problems • Document and analyse the problems and issued − Identify Remediation Activities • Identify remediation activities and define programme of work May 3, 2021 25
  26. 26. Data Profiling Complexity • Do not underestimate the complexity, effort and resources required for data profiling • A products can make the task easier but it is not a panacea • Data profiling can be a continuous activity as data changes and the target data catalog needs to be maintained and udpdated May 3, 2021 26
  27. 27. Data Catalog • Set of information (metadata) containing details on organisation information resources - datasets • Data catalog can be static or semi-static data structure created and maintained manually • Metadata is structured, consistent and indexed for fast and easy access and use • Contains descriptions of data resources • Enables user self-service data discovery and usage • Provides data discovery tools and facilities • Data catalog assists with implementing FAIR (Findable, Accessible, Interoperable, Reusable) data − Findable – details on data available on specific topics and subjects can be found easily and quickly − Accessible – underlying data can be accessed − Interoperable – metadata ensures can be aggregated and integrated across data types − Reusable – detailed metadata ensures data can be reused in the future May 3, 2021 27
  28. 28. FAIR (Findable, Accessible, Interoperable, Reusable) • https://fairsharing.org/ - sample data collections • https://www.go-fair.org/ - implementation of FAIR data principles - https://www.go-fair.org/fair-principles/ • https://www.schema.org/ - contains sample metadata schemas • Strong academic focus but the principle can be applied elsewhere May 3, 2021 28
  29. 29. Data Catalog Functionality Complexity May 3, 2021 29 Registry •Simple registry of data sources with links to their location and access mechanisms Metadata Content •Contains descriptions of the contents of the data sources Structured and Processable Metadata •Metadata is held in a structured and queryable format Data Relationships •Holds details on metadata and data concepts/themes with relationships between data sources Content and Meaning Relationships •Semantic mappings (visual representation of linkages) and relationships among domains of different datasets • Data catalogs can be simple or complex • Greater complexity requires more effort and the use of tools • Greater complexity ensures greater data usability and usefulness • Catalog can be constructed (semi) automatically using data profiling tools • The data catalog must be constantly updated as data changes
  30. 30. Data Catalogs, Master Data Management, Data Profiling And Data Quality Relationships May 3, 2021 30 Data Catalog Structured information about data sources, contents and access methods Master Data Management Layer above operational systems dynamically linking data together Data Profiling Discovery and documentation of data sources, types, dictionaries, values, relationships, usage Data Quality Defining, monitoring and improving data quality, accuracy, cleansing, consistency and fitness to use MDM Operationalises the Data Catalog Quality Underpins Data Catalog MDM Ensures Data Quality Data Profiling Necessary to Build a Data Catalog MDM Tools Can Automate Data Profiling
  31. 31. Data Catalog Vocabulary (DCAT) • See https://www.w3.org/TR/vocab-dcat-2/ • Resource Description Framework (RDF) metadata data model • DCAT is a standard for describing datasets in a data catalog May 3, 2021 31
  32. 32. Related Concepts • Business Glossary – defines terms and concepts across a business domain providing an authoritative source for business operations • Data Dictionary – collection of names, definitions and attributes about data items that are being used or stored in a database May 3, 2021 32
  33. 33. Data Catalog Tools • Many commercial data catalog tools – many overlap with master data management • Open source options − CKAN - https://ckan.org/ − Dataverse - https://dataverse.org/ − Invenio - https://inveniosoftware.org/ − QuiltData - https://quiltdata.com/ − Zenodo - https://zenodo.org/ − Kylo - https://kylo.io/ • Can use to test concept before investing in commercial tools • Can also use trial version of Azure Data Catalog - https://docs.microsoft.com/en-us/azure/data- catalog/overview May 3, 2021 33
  34. 34. Metadata • Data that provides information about other data resources that enables relevant data be discovered, understood and managed reliably and consistently • There are various classifications of metadata types May 3, 2021 34
  35. 35. Possible Metadata Structure And Organisation May 3, 2021 35 Types of Metadata Descriptive Information about the data resource contained in a set of metadata fields, Language How data can be discovered Business What the data is, its sources, meaning and relationships with other data Location Ownership, Authorship Structural How the data is organised and how versions are maintained? Formats, contents, dictionaries Administrative /Process How the data should be managed and administered through its lifecycle stages Who can perform what operations on the metadata Security and access restrictions and rights Data preservation and retention Legal constraints and compliance requirements Statistical Information on actual data creation and usage and other volumetrics Reference Sets of values for structured metadata fields Content Automatically generated (unstructured) metadata from content Technical Infrastructural requirements Exchange and interface requirements, interoperability API requirements and usage
  36. 36. Metadata Harmonisation • Metadata Harmonisation can mean: 1. The ability of interaction data systems to exchange their individual sets of metadata (that may comply with different metadata standards/approaches/schemas) and to consistently and coherently interpret and understand the exchanged metadata 2. The conversion of existing metadata held in different systems to a common standard • Harmonised metadata makes finding and comparing information easier May 3, 2021 36
  37. 37. Key Metadata Harmonisation Principles • Evaluation – Source, target metadata structures/schemas and the underlying data should be profiled before any target metadata schema design work starts • Matching – Match existing metadata structures involving extraction and analysis of data from source systems • Transformation – Map the source schemas and geometry to the common target schema • Validation – Assess the conformance of metadata • Publication – Make transformed metadata schema available • Management – Ongoing management, administration and maintenance May 3, 2021 37
  38. 38. Metadata Concerns • No consistent schema and nomenclature being used • Each system will maintain different sets of metadata • No consistent set of values (vocabulary/dictionary/code lists) for metadata fields • Difficult to perform reliable comparisons across metadata May 3, 2021 38
  39. 39. Metadata Data Catalog • Set of information (metadata) containing details on organisation information resources - datasets • Data catalog can be static or semi-static data structure created and maintained manually • Metadata is structured, consistent and indexed for fast and easy access and use • Contains descriptions of data resources • Enables user self-service data discovery and usage • Provides data discovery tools and facilities • Data catalog assists with implementing FAIR (Findable, Accessible, Interoperable, Reusable) data − Findable – details on data available on specific topics and subjects can be found easily and quickly − Accessible – underlying data can be accessed − Interoperable – metadata ensures can be aggregated and integrated across data types − Reusable – detailed metadata ensures data can be reused in the future May 3, 2021 39
  40. 40. Scope Of Wider Data Management May 3, 2021 40 Data Management Data Governance Data Architecture Management Data Development Data Operations Management Data Security Management Data Quality Management Data Integration Management Reference and Master Data Management Data Warehousing and Business Intelligence Management Document and Content Management Metadata Management
  41. 41. Reference And Master Data Management • Reference and Master Data Management is the ongoing reconciliation and maintenance of reference data and master data − Reference Data Management is control over defined domain values (also known as vocabularies), including control over standardised terms, code values and other unique identifiers, business definitions for each value, business relationships within and across domain value lists, and the consistent, shared use of accurate, timely and relevant reference data values to classify and categorise data − Master Data Management is control over master data values to enable consistent, shared, contextual use across systems, of the most accurate, timely, and relevant version of truth about essential business entities • Reference data and master data provide the context for transaction data May 3, 2021 41
  42. 42. Reference and Master Data Management – Definition and Goals • Definition − Planning, implementation, and control activities to ensure consistency with a golden version of contextual data values • Goals − Provide authoritative source of reconciled, high-quality master and reference data − Lower cost and complexity through reuse and leverage of standards − Support business intelligence and information integration efforts May 3, 2021 42
  43. 43. May 3, 2021 43 •Business Drivers •Data Requirements Policy and Regulations •Standards •Code Sets •Master Data •Transactional Data Inputs •Steering Committees •Business Data Stewards •Subject Matter Experts •Data Consumers •Standards Organisations •Data Providers Suppliers •Reference Data Management Applications •Master Data Management Applications •Data Modeling Tools •Process Modeling Tools •Metadata Repositories •Data Profiling Tools •Data Cleansing Tools •Data Integration Tools •Business Process and Rule Engines Change Management Tools Tools •Data Stewards •Subject Matter Experts •Data Architects •Data Analysts •Application Architects •Data Governance Council •Data Providers •Other IT Professionals Participants •Master and Reference Data Requirements •Data Models and Documentation •Reliable Reference and Master Data •Golden Record Data Lineage •Data Quality Metrics and Reports •Data Cleansing Services Primary Deliverables •Reference and Master Data Quality •Change Activity •Issues, Costs, Volume •Use and Re-Use •Availability •Data Steward Coverage Metrics Reference and Master Data Management •Application Users •BI and Reporting Users •Application Developers and Architects •Data Integration Developers and Architects •BI Developers and Architects •Vendors, Customers, and Partners Consumers
  44. 44. Reference And Master Data Management – Principles • Shared reference and master data belongs to the organisation, not to a particular application or department • Reference and master data management is an on-going data quality improvement program; its goals cannot be achieved by one project alone • Business data stewards are the authorities accountable for controlling reference data values. Business data stewards work with data professionals to improve the quality of reference and master data • Golden data values represent the organisation’s best efforts at determining the most accurate, current, and relevant data values for contextual use. New data may prove earlier assumptions to be false. Therefore, apply matching rules with caution, and ensure that any changes that are made are reversible • Replicate master data values only from the database of record • Request, communicate, and, in some cases, approve of changes to reference data values before implementation May 3, 2021 44
  45. 45. Reference Data • Reference data is data used to classify or categorise other data • Business rules usually dictate that reference data values conform to one of several allowed values • In all organisations, reference data exists in virtually every database • Reference tables link via foreign keys into other relational database tables, and the referential integrity functions within the database management system ensure only valid values from the reference tables are used in other tables May 3, 2021 45
  46. 46. Master Data • Master data is data about the business entities that provide context for business transactions • Master data is the authoritative, most accurate data available about key business entities, used to establish the context for transactional data • Master data values are considered golden • Master Data Management is the process of defining and maintaining how master data will be created, integrated, maintained, and used throughout the enterprise May 3, 2021 46
  47. 47. Master Data Challenges • What are the important roles, organisations, places, and things referenced repeatedly? • What data is describing the same person, organisation, place, or thing? • Where is this data stored? What is the source for the data? • Which data is more accurate? Which data source is more reliable and credible? Which data is most current? • What data is relevant for specific needs? How do these needs overlap or conflict? • What data from multiple sources can be integrated to create a more complete view and provide a more comprehensive understanding of the person, organisation, place or thing? • What business rules can be established to automate master data quality improvement by accurately matching and merging data about the same person, organisation, place, or thing? • How do we identify and restore data that was inappropriately matched and merged? • How do we provide our golden data values to other systems across the enterprise? • How do we identify where and when data other than the golden values is used? May 3, 2021 47
  48. 48. Understand Reference And Master Data Integration Needs • Reference and master data requirements are relatively easy to discover and understand for a single application • Potentially much more difficult to develop an understanding of these needs across applications, especially across the entire organisation • Analysing the root causes of a data quality problem usually uncovers requirements for reference and master data integration • Organisations that have successfully managed reference and master data typically have focused on one subject area at a time − Analyse all occurrences of a few business entities, across all physical databases and for differing usage patterns May 3, 2021 48
  49. 49. Define and Maintain the Data integration Architecture • Effective data integration architecture controls the shared access, replication, and flow of data to ensure data quality and consistency, particularly for reference and master data • Without data integration architecture, local reference and master data management occurs in application silos, inevitably resulting in redundant and inconsistent data • The selected data integration architecture should also provide common data integration services − Change request processing, including review and approval − Data quality checks on externally acquired reference and master data − Consistent application of data quality rules and matching rules − Consistent patterns of processing − Consistent metadata about mappings, transformations, programs and jobs − Consistent audit, error resolution and performance monitoring data − Consistent approach to replicating data • Establishing master data standards can be a time consuming task as it may involve multiple stakeholders • Apply the same data standards, regardless of integration technology, to enable effective standardisation, sharing, and distribution of reference and master data May 3, 2021 49
  50. 50. Data Integration Services Architecture May 3, 2021 50 Data Quality Management Metadata Management Integration Metadata Job Flow and Statistics Data Acquisition, File Management and Audit Replication Management Data Standardisation Cleansing and Matching Business Metadata Source Data Archives Rules Errors Staging Reconciled Master Data Subscriptions
  51. 51. Implement Reference And Master Data Management Solutions • Reference and master data management solutions are complex • Given the variety, complexity, and instability of requirements, no single solution or implementation project is likely to meet all reference and master data management needs • Organisations should expect to implement reference and master data management solutions iteratively and incrementally through several related projects and phases May 3, 2021 51
  52. 52. Define And Maintain Match Rules • Matching, merging, and linking of data from multiple systems about the same person, group, place, or thing is a major master data management challenge • Matching attempts to remove redundancy, to improve data quality, and provide information that is more comprehensive • Data matching is performed by applying inference rules − Duplicate identification match rules focus on a specific set of fields that uniquely identify an entity and identify merge opportunities without taking automatic action − Match-merge rules match records and merge the data from these records into a single, unified, reconciled, and comprehensive record − Match-link rules identify and cross-reference records that appear to relate to a master record without updating the content of the cross- referenced record May 3, 2021 52
  53. 53. Vocabulary Management And Reference Data • A vocabulary is a collection of terms / concepts and their relationships • Vocabulary management is defining, sourcing, importing, and maintaining a vocabulary and its associated reference data − See ANSI/NISO Z39.19 - Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies - http://www.niso.org/kst/reports/standards?step=2&gid=&project _key=7cc9b583cb5a62e8c15d3099e0bb46bbae9cf38a • Vocabulary management requires the identification of the standard list of preferred terms and their synonyms • Vocabulary management requires data governance, enabling data stewards to assess stakeholder needs May 3, 2021 53
  54. 54. Vocabulary Management And Reference Data • Key questions to ask to enable vocabulary management − What information concepts (data attributes) will this vocabulary support? − Who is the audience for this vocabulary? What processes do they support, and what roles do they play? − Why is the vocabulary needed? Will it support applications, content management, analytics, and so on? − Who identifies and approves the preferred vocabulary and vocabulary terms? − What are the current vocabularies different groups use to classify this information? Where are they located? How were they created? Who are their subject matter experts? Are there any security or privacy concerns for any of them? − Are there existing standards that can be leveraged to fulfil this need? Are there concerns about using an external standard vs. internal? How frequently is the standard updated and what is the degree of change of each update? Are standards accessible in an easy to import / maintain format in a cost efficient manner? May 3, 2021 54
  55. 55. Defining Golden Master Data Values • Golden data values are the data values thought to be the most accurate, current, and relevant for shared, consistent use across applications • Determine golden values by analysing data quality, applying data quality rules and matching rules, and incorporating data quality controls into the applications that acquire, create, and update data • Establish data quality measurements to set expectations, measure improvements, and help identify root causes of data quality problems • Assess data quality through a combination of data profiling activities and verification against adherence to business rules • Once the data is standardised and cleansed, the next step is to attempt reconciliation of redundant data through application of matching rules
  56. 56. Define And Maintain Hierarchies And Affiliations • Vocabularies and their associated reference data sets are often more than lists of preferred terms and their synonyms • Affiliation management is the establishment and maintenance of relationships between master data records
  57. 57. Plan And Implement Integration Of New Data Sources • Integrating new reference data sources involves − Receiving and responding to new data acquisition requests from different groups − Performing data quality assessment services using data cleansing and data profiling tools − Assessing data integration complexity and cost − Piloting the acquisition of data and its impact on match rules − Determining who will be responsible for data quality − Finalising data quality metrics
  58. 58. Replicate And Distribute Reference And Master Data • Reference and master data may be read directly from a database of record, or may be replicated from the database of record to other application databases for transaction processing, and data warehouses for business intelligence • Reference data most commonly appears as pick list values in applications • Replication aids maintenance of referential integrity
  59. 59. Manage Changes To Reference And Master Data • Specific individuals have the role of a business data steward with the authority to create, update, and retire reference data • Formally control changes to controlled vocabularies and their reference data sets • Carefully assess the impact of reference data changes
  60. 60. Data Governance And MDM Success Factors • Master Data Management will support business by providing a strategy, governance policies and technologies for customer, product, and entitlement information by following the Master Data Management Guiding Principles − Master data management will use (and where needed create) a “single version of the truth” for customer, product, and asset entitlement master data consolidated into a single master data system − Master data management will establish standard data definition and usage will be consistent to simplify business processes across enterprise systems − Master data management systems and processes will be flexible and adaptable to handle domestic and global expansion to support growth in both established and emerging markets − Master data management will adhere to a standards governance process to ensure key data elements are created, maintained, cleansed and converted to be syndicated across enterprise systems − Master data management will identify responsibilities and monitor accountability for customer, product, and entitlement information Master data management will facilitate cross-functional collaboration and manage continuous improvement of master data for customer, product, and entitlement domains
  61. 61. Data Governance is Not A Choice – It Is A Necessity • “We’ve got to stop having the ‘who owns the data?’ conversation.” • “We can’t do MDM if we don’t formalise decision-making processes around our enterprise information.” • “Fixing the data in a single system is pointless; we don’t know what the rules are across our systems.” • “Everyone agrees data quality is poor, but no one can agree on how to fix it.” • “Are you kidding? We have multiple versions of the single- version-of-the truth.”
  62. 62. MDM Program Critical Success Factors • Strategy − Drive and promote alignment with corporate strategic initiatives and pillar specific goals − Definition of criteria and core attributes that define domains and related objects • Solution − Alignment with corporate strategic initiatives and pillar specific goals − Identification of “Quick Wins” that have measurable impact − Clear definition of metrics for measuring data improvement − Leading industry practices have been incorporated solution design • Governance − Executive Ownership and Governance organisation has been rationalised established to address federated data management needs − Data Quality is addressed at all points of processes, as well as customer and product lifecycles • End-to-end Roadmap − Prioritised program roadmap for “Quick Wins” − Prioritised program roadmap for CDM strategic initiatives − Fully vetted CBA for each roadmap item − “No Regrets” actions are rationalised and aligned strategic roadmap
  63. 63. More Information Alan McSweeney http://ie.linkedin.com/in/alanmcsweeney https://www.amazon.com/dp/1797567616 3 May 2021 63

×