Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Automated Metadata Management in Data Lake – A CI/CD Driven Approach

Download to read offline

We as data engineers are aware of trade off’s between development speed, metadata governance and schema evolution (or restriction) in rapidly evolving organization. Our day to day activities involve adding/removing/updating tables, protecting PII Information, curating and exposing data to our consumers. While our data lake keeps growing exponentially, there is equal increase in our downstream consumers. Struggle is to maintain balance between quickly promoting metadata changes with robust validation for downstream systems stability. In relational world DDL, DML changes can be managed through numerous options available for every kind of database from the vendor or 3rd party. As engineers we developed a tool which uses centralized git managed repository of data schemas in yml structure with ci/cd capabilities which maintains stability of our data lake and downstream systems.

In this presentation Northwestern Mutual Engineers, will discuss how they designed and developed new end-to-end ci/cd driven metadata management tool to make introduction of new tables/views, managing access requests etc in a more robust, maintainable and scalable way, all with only checking in yml files. This tool can be used by people who have no or minimal knowledge of spark.

Key focus will be:

Need for metadata management tool in a data lake
Architecture and Design of the tool
Maintaining information on databases/tables/views like schema, owner, PII, description etc in simple to understand yml structure
Live demo of creating a new table with CI/CD promotion to production

  • Be the first to like this

Automated Metadata Management in Data Lake – A CI/CD Driven Approach

  1. 1. Automated Metadata Management in Data Lake – A CI/CD Driven Approach Keyuri Shah, Lead Engineer Josh Reilly, Lead Engineer
  2. 2. Agenda § Introduction § Need for Metadata Management § Architecture § Overview on Tool § Live Demo
  3. 3. Need for Metadata Management • What is Metadata Management • Motivation for a config driven tool • Governance • Easy to Maintain • Development Stack • Python • Gitlab CI • Use Cases • Enterprise Data Lake • Sharing Lake across different teams
  4. 4. Config File Options ▪ name ▪ owning_team ▪ description ▪ access: ▪ type: ad_group ▪ <env>: AD_GRP_NAM ▪ type: data • name • description • schema • encrypted_columns • masked_columns • Tables • Database • name • database • description • query • Views
  5. 5. Design
  6. 6. Live Demo • Create a database/table/view config • Check in to Git • Run CICD pipeline to plan and apply to int • Verify Database/Table/View in Databricks • Update Schema & Run Pipeline
  7. 7. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Keyuri Shah: https://www.linkedin.com/in/keyuri-shah Josh Reilly: https://www.linkedin.com/in/josh-reilly-51052996/

We as data engineers are aware of trade off’s between development speed, metadata governance and schema evolution (or restriction) in rapidly evolving organization. Our day to day activities involve adding/removing/updating tables, protecting PII Information, curating and exposing data to our consumers. While our data lake keeps growing exponentially, there is equal increase in our downstream consumers. Struggle is to maintain balance between quickly promoting metadata changes with robust validation for downstream systems stability. In relational world DDL, DML changes can be managed through numerous options available for every kind of database from the vendor or 3rd party. As engineers we developed a tool which uses centralized git managed repository of data schemas in yml structure with ci/cd capabilities which maintains stability of our data lake and downstream systems. In this presentation Northwestern Mutual Engineers, will discuss how they designed and developed new end-to-end ci/cd driven metadata management tool to make introduction of new tables/views, managing access requests etc in a more robust, maintainable and scalable way, all with only checking in yml files. This tool can be used by people who have no or minimal knowledge of spark. Key focus will be: Need for metadata management tool in a data lake Architecture and Design of the tool Maintaining information on databases/tables/views like schema, owner, PII, description etc in simple to understand yml structure Live demo of creating a new table with CI/CD promotion to production

Views

Total views

113

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

7

Shares

0

Comments

0

Likes

0

×