Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Inside open metadata—the deep dive

798 views

Published on

In this session we take an in-depth look into the Apache Atlas open metadata and governance function.

Open metadata and governance is a moon-shot type of project to create a set of open APIs, types, and interchange protocols to allow all metadata repositories to share and exchange metadata. From this common base, it adds governance, discovery, and access frameworks to automate the collection, management, and use of metadata across an enterprise. The result is an enterprise catalog of data resources that are transparently assessed, governed, and used in order to deliver maximum value to the enterprise.

Apache Atlas is the reference implementation of the Open Metadata and Governance standards and framework (https://cwiki.apache.org/confluence/display/ATLAS/Open+Metadata+and+Governance). This function will enable an Apache Atlas server to synchronize and query metadata from any open metadata-compliant metadata repository.

In this session we will cover how Open Metadata and Governance works. This includes: (1) the key components in Atlas, (2) the different integration patterns and APIs that vendors can use to integrate their technology into the open metadata ecosystem, and (3) how common metadata use cases such as searching for data sets, managing security (through Atlas/Ranger integration), and automated metadata discovery work in the active ecosystem.

Speaker
Mandy Chessell, Distinguished Engineer, IBM

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Inside open metadata—the deep dive

  1. 1. Mandy Chessell CBE FREng CEng FBCS Distinguished Engineer, Master Inventor Analytics Chief Data Office  mandy_chessell@uk.ibm.com 18th April 2018 Good analytics needs good data and that needs good metadata
  2. 2. Apache Atlas as an open innovation platform for metadata management and governance3 Agenda  Why is metadata so important today?  What is the challenge?  Building an open ecosystem  Apache Atlas and the specifics  ODPI Data Governance PMC  Progress report and call to action
  3. 3. Apache Atlas as an open innovation platform for metadata management and governance4 Open Data Site The perils of reusing data … Data Lake Employee Directory Callie Quartile uses (1) open data from the local government registrar and (2) data from the employee directory to (3) create a birthday card service for the company. Callie Quartile Data Scientist 1 3 2
  4. 4. Apache Atlas as an open innovation platform for metadata management and governance5 Open Data Site The perils of reusing data … Data Lake Employee Directory Callie Quartile Data Scientist 1 3 2 Happy Birthday But its not my birthday Unfortunately the obvious date in the registrar record was the registration of birth date not the date of birth. Date of birth was not published in the open data. Callie needed better information about the open data to realise she had the wrong data.
  5. 5. Apache Atlas as an open innovation platform for metadata management and governance6 Metadata should bring as much information about the data sets to Callie’s data science as is known collectively by the organization. Employee Directory NameBand Job Title X Data Set Name: Employee Directory X Description: Core attributes describing all employees of OCO pharmaceuticals created from a daily extract from Kenexa. Owner: Penny Payer Status: Last accessed: 6th May 2016 Records: 3488 Last Update: 1st May 2016 Contents: Structure … Contents … Lineage … XColumn: Band Classification Ranges: Confidentiality: Public, Confidential, Sensitive Confidence: Authoritative Retention: Indefinitely Characteristi cs LineageDescription Position reference number for non- exempt employees. The value ranges from 01 to 06 where 01 is the most senior and 06 is the most junior. Type: String Classification: Public
  6. 6. Apache Atlas as an open innovation platform for metadata management and governance7 Different personas need different services Callie Quartile Data Scientist Jules Keeper Chief Data Officer Find data Understand data Manage analytics models Build data strategy Define governance program Monitor progress
  7. 7. Apache Atlas as an open innovation platform for metadata management and governance8 Different personas need different services Faith Broker HR and Privacy Officer Gary Geeke IT Locate personal data Ensure protection of personal data Understand employee needs Maintain “safe” IT Infrastructure Build and deploy “good” APIs and services Locate and resolve issues fast
  8. 8. Apache Atlas as an open innovation platform for metadata management and governance9 Different personas need different services Tanya Tidie Clinical Trials Administrator Ivor Padlock Chief Security Officer Maintain accurate patient records Catalog clinical trials data Demonstrate good data management practices Understand risks to organization Set up protection Monitor for suspicious activity
  9. 9. Apache Atlas as an open innovation platform for metadata management and governance10 Scope of metadata for a data driven organization Glossary Collaboration Governance Models and Reference Data Metadata Discovery Lineage Data Assets Base Types, Systems and Infrastructure
  10. 10. Apache Atlas as an open innovation platform for metadata management and governance11 Curation 00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3 00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3 00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3 I know I wonder what this means
  11. 11. Apache Atlas as an open innovation platform for metadata management and governance12 Scared to share Faith Broker Business Team 00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3 00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3 00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3 Faith Broker has been doing some simple analysis on the HR data of the company. She wants to share this data with Callie Quartile to do some detailed work. However, she does not want Callie to see the sensitive personal information in the record. 00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 XXXXX XXX 27 Code St Harlem NY 1 3 00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 XXXXX XXX 27 Code St Harlem NY 1 3 00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 XXXXX XXX 27 Code St Harlem NY 1 3 Callie Quartile Data Scientist
  12. 12. Apache Atlas as an open innovation platform for metadata management and governance13 Business metadata Structural metadata for a data store Using glossary function for semantic processing EMPNAME EMPNO JOBCODE SALARY EMPLOYEE RECORD Employee Work Location Annual Salary Job Title Employee Id Employee Name Hourly Pay Rate Manager Compensation Plan HAS-A HAS-A HAS-A HAS-A HAS-A HAS-A IS-A IS-A Sensitive IS-A Data 00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
  13. 13. Apache Atlas as an open innovation platform for metadata management and governance14 Why do we need metadata?  Metadata enables data to be used outside of the application that created it. • Analytics and decision making • New business applications • Reporting and compliance  Metadata describes the format and content of data allowing people to judge which data set to use for a new project • Structure • Meaning • Origin • Valid values and quality • Usage and ownership • Regulations and classifications that apply • <more>  Metadata describes the business context and classification of data allowing automated governance processes to operate.
  14. 14. Apache Atlas as an open innovation platform for metadata management and governance15 Today’s reality  Many data platforms do not have metadata support  Proprietary tools support a range of data sources and governance actions • No-one supports everything you need and assumes all tools come from their suite • Each tool starts “empty” requiring effort to populate metadata • Each tool operates as if it is the only tool • No integration/interoperability of metadata repositories from different vendors  Expensive efforts to create an enterprise data catalogue
  15. 15. Apache Atlas as an open innovation platform for metadata management and governance16 Today’s reality
  16. 16. Apache Atlas as an open innovation platform for metadata management and governance17 Manual metadata capture
  17. 17. Apache Atlas as an open innovation platform for metadata management and governance18 Automatic metadata capture 18
  18. 18. Apache Atlas as an open innovation platform for metadata management and governance19 What needs to change? Open and Unified Metadata
  19. 19. Apache Atlas as an open innovation platform for metadata management and governance20 A new manifesto for metadata and governance  Metadata management must be automated  Metadata management must become ubiquitous  Metadata must become open and remotely accessible  Metadata should be used to drive the governance of data The discovery, maintenance and use of metadata has to be an integral part of all tools that access, change and move information. 20
  20. 20. Apache Atlas as an open innovation platform for metadata management and governance21 Open metadata management ecosystem  Peer-to-peer network of repositories  Metadata stored and managed close to its source  Each repository/tool brings unique value.  Open, extensible metadata structures for metadata exchange and federation – extending coverage of the types of resources that need to be described.  Open source infrastructure sharing cost of development and maintenance between vendors  Support for open standards where available Collaboration Space Metadata Analytics Platform Metadata Application Metadata Cloud SaaS platform Metadata Hadoop Platform Metadata
  21. 21. Apache Atlas as an open innovation platform for metadata management and governance22 Apache Atlas http://atlas.apache.org/  Apache Atlas has just graduated to become a top-level project.  It began as an incubator open source project on 5th May 2015 to deliver an open source governance capability focused primarily on the Hadoop platform.  Apache Atlas is designed to localize operational governance to the operating data platform such as Hadoop.  At its heart is a type-agnostic metadata store that can be access through restful interfaces. We see Apache Atlas as the reference implementation for open metadata and governance, for vendors to pick up and use; or test their integration against. Being open source allows all vendors to enrich/enhance standard.
  22. 22. Apache Atlas as an open innovation platform for metadata management and governance23 Apache Atlas today
  23. 23. Apache Atlas as an open innovation platform for metadata management and governance24 Updates to Apache Atlas  Automation • Capture of metadata from data platforms, data movement engines and data protection engines. • Exception management and stewardship  Business Value • Specialized services for key data roles such as CDO, Data Scientist, Developer, DevOps Operator, Asset Owner, Applications  Connectivity • Metadata Highway offering open metadata exchange, linking and federation between heterogeneous metadata repositories.
  24. 24. Apache Atlas as an open innovation platform for metadata management and governance25 Taking guidance from existing metadata standards  Well-defined  Complementary  Integrating  Decoupled https://www.w3.org/TR/vocab-dcat/
  25. 25. Apache Atlas as an open innovation platform for metadata management and governance26 Instance representations in the graph
  26. 26. Apache Atlas as an open innovation platform for metadata management and governance27 Open metadata meta-types, types and instances «relationship» DataContentForDataSet * * dataContent supportedDataSets «entity» DataSet createTime : date modifiedTime : date «entity» DataStore «entity» Asset «entity» GlossaryTerm «entity» Referenceable description : string expression : string status : TermAssignmentStatus confidence : int steward : string source : string «relationship» SemanticAssignment * * assignedElements meaning
  27. 27. Apache Atlas as an open innovation platform for metadata management and governance28 Open metadata type model summary Glossary Collaboration Governance Models and Reference Data Metadata Discovery Lineage Data Assets 4 3 1 5 2 6 7 Base Types, Systems and Infrastructure 0
  28. 28. Apache Atlas as an open innovation platform for metadata management and governance29 Open metadata type model summary Policy Metadata (Principles, Regulations, Standards, Approaches, Rule Specifications, Roles and Metrics) Governance Actions and Processes Augmentation MappingImplementation Business Objects and Relationships, Taxonomies and Ontologies Business Attributes Organization Teaming Metadata (people profiles, communities, projects, notebooks, …) Models and Schemas 4 3 1 5 Physical Asset Descriptions (Data stores, APIs, models and components) Asset Collections (Sets, Typed Sets, Type Organized Sets) Information Views Rights Management Reference Data Feedback Metadata (tags, comments, ratings, …) ClassificationSchemes Classification Strategy Subject Area Definition Campaigns and Projects Rollout 2 Discovery Metadata (profile data, technical classification, data classification, data quality assessment, …) Augmentation Instrument Association Information Process Instrumentation (design lineage) 6 7 O-DEF O-BDL ConnectorsBasic Types, Infrastructure and Systems Access 0
  29. 29. Apache Atlas as an open innovation platform for metadata management and governance30 More detail here … https://cwiki.apache.org/confluence/display/ATLAS/Building+out+the+Open+Metadata+Typesystem
  30. 30. Apache Atlas as an open innovation platform for metadata management and governance31 Metadata and governance digital platform Open Metadata and Governance Reporting Platform ETL Platform Analytics Platform Virtualization Platform Governance Platform Data Platform
  31. 31. Apache Atlas as an open innovation platform for metadata management and governance32 Types of tools that may integrate with an open metadata repository  BI and visualization tools • locating data assets and related information about them; defining reports and publishing their metadata; viewing lineage  Data Science tool • wanting to find out about data assets available and manage user lineage of transformations and analytics models – may also manage metadata for analytics models  API developer tool • wanting to understand proper data structures and data meaning to use for APIs – plus additional governance requirements that need to be implemented by API because of the data it exchanges.  Counter-fraud tools • ad hoc analysis of logs and error reports, setting up rules  Curator/owner tool • for managing the curation of assets, providing access, verifying use of assets, reviewing discovery results and exceptions, approving change requests.  Glossary tool • for subject matter experts and information architects to share expertise about a particular subject area – may also define structures and related reference data  Enterprise architect tools • defining the data landscape and related systems.  DevOps tools • conformance to polices and standards in development • metadata capture at deployment • validation of deployment platform requirements  Data integration engine • locating appropriate data and component assets, log design lineage, log operational lineage  Information Virtualisation tools • locate appropriate data assets, build views and publish them, add design lineage, log operational lineage  Governance tools • setting up and monitoring governance program, data quality, …  Stewardship tools • reviewing assigned exceptions, making data changes and requesting approval  Information security tools • setting up data access policies and enforcement  Auditor tools • view compliance reports and validate policies and policy implementations
  32. 32. Apache Atlas as an open innovation platform for metadata management and governance33 Open Metadata Access Services Project Management Community ProfileAsset Catalog Stewardship Action Information View Governance Program Information Process Subject Area Connected Asset Discovery Governance Engine Information Protection Developer Data Platform Asset Owner Information Landscape Data Science DevOps Asset Consumer Information Infrastructure
  33. 33. Apache Atlas as an open innovation platform for metadata management and governance34 OMAS service instance Both call API and notifications
  34. 34. Apache Atlas as an open innovation platform for metadata management and governance35 Inside the server Open Metadata and Governance (OMAG) Server Open Metadata Access Services (OMAS) OMRS Topic Connector OMRS Cohort Registry Store Connector OMRS Archive Connector OMRS AuditLog Connector OMRS Event Mapper Connector OMRS Repository Connector Server Configuration OMAS REST APIs and Topics OMAG Administration REST APIs OMRS Repository REST APIs Open Metadata Repository Services (OMRS)
  35. 35. Apache Atlas as an open innovation platform for metadata management and governance36 Inside the server Open Metadata and Governance (OMAG) Server Open Metadata Access Services (OMAS) OMRS Topic Connector OMRS Cohort Registry Store Connector OMRS Archive Connector OMRS AuditLog Connector OMRS Event Mapper Connector OMRS Repository Connector Server Configuration OMAS REST APIs and Topics OMAG Administration REST APIs OMRS Repository REST APIs Administration Enterprise Repository Services Local Repository Services Cohort Services
  36. 36. Apache Atlas as an open innovation platform for metadata management and governance37 Integration patterns https://cwiki.apache.org/confluence/display/ATLAS/Integrating+into+the+Open+Metadata+and+Governance+Ecosystem IBM Information Governance Catalog Apache Atlas
  37. 37. Apache Atlas as an open innovation platform for metadata management and governance38 Caller Pattern  A metadata tool can access the consumer-specific APIs to work with metadata.  The Access Layer handles the calls to metadata repositories connected to the metadata highway
  38. 38. Apache Atlas as an open innovation platform for metadata management and governance39 Native Pattern  Native implementation of the open metadata governance APIs  Apache Atlas is a native implementation of the open metadata and governance APIs.
  39. 39. Apache Atlas as an open innovation platform for metadata management and governance40 Adapter Pattern  Simple components plug into a repository proxy to connect in an existing metadata repository.
  40. 40. Apache Atlas as an open innovation platform for metadata management and governance41 Plug-in Pattern  Open Connector Framework (OCF) • Connectors to data, analytics etc  Open Discovery Framework (ODF) • Metadata discovery services  Governance action Framework (GAF) • Stewardship services for triage and remediation of exceptions
  41. 41. Apache Atlas as an open innovation platform for metadata management and governance42 IBM Unified Governance
  42. 42. Apache Atlas as an open innovation platform for metadata management and governance43 Simple cohort Cohort A Chief Data Office Data Lake Systems of Record
  43. 43. Apache Atlas as an open innovation platform for metadata management and governance44 Multiple Cohorts Cohort BCohort A Chief Data Office Data Lake Systems of Record Mobile Apps Data Lake Systems of Record Marketing
  44. 44. Apache Atlas as an open innovation platform for metadata management and governance45 First server
  45. 45. Apache Atlas as an open innovation platform for metadata management and governance46 Establishing contact
  46. 46. Apache Atlas as an open innovation platform for metadata management and governance47 Federated queries
  47. 47. Apache Atlas as an open innovation platform for metadata management and governance48 Caching metadata for availability and performance
  48. 48. Apache Atlas as an open innovation platform for metadata management and governance49 ODPI - co-creation with practitioners • Compliance assistance and certification for vendors • Subject matter experts sharing best practices and co-creating content packs https://github.com/odpi/data-governance
  49. 49. Apache Atlas as an open innovation platform for metadata management and governance50 • Your governance program is based on established practices and definitions • Allows a broader range of tools in your organization • Automated governance processes protect and manage your data Your metadata offerings will deliver value faster as they tap into metadata collected by other vendor’s tools. ODPi packages extend your metadata system’s and tools’ capabilities Conformance tests minimize your effort in being compliant with key standards and regulations. Customers have increased confidence in your tools and services due to ODPi certification. Data Governance Professionals Vendors How ODPi Helps
  50. 50. Apache Atlas as an open innovation platform for metadata management and governance51 Summary  Big data is creating new opportunities and requirements that needs new types of systems. Data Lakes are just one part of this story.  Metadata is critical to make the best use of this data for the widest range of scenarios.  Most organizations use tools and platforms from many vendors.  Open standards have had limited take-up  Can we use open source to create a digital platform that allows vendors to take advantage of metadata from a broader ecosystem? • Open Metadata and Governance defines the standards • Apache Atlas provides the reference implementation • ODPi helps to build the ecosystem
  51. 51. Apache Atlas as an open innovation platform for metadata management and governance52 Call to action – how can you help?  Direct contribution to the Apache Atlas and/or ODPi Data Governance projects. • There are many features that still need to be developed.  Encouraging your vendors/partners and projects internal to your organization to embrace the Open Metadata and Governance standards to grow the ecosystem of data and processing that is assured by metadata and governance capability. 52
  52. 52. Apache Atlas as an open innovation platform for metadata management and governance53 https://cwiki.apache.org/confluence/display/ATLAS/Atlas+Projects
  53. 53. Apache Atlas as an open innovation platform for metadata management and governance54 zzzz z z z Questions?

×