SlideShare a Scribd company logo
1 of 28
Download to read offline
Amundsen at Brex
Why Amundsen?
(why data catalog at all?
Documentation and discoverability
1. Data scientists and analysts found it difficult to learn what a table is or how and where its used
2. Documentation often doesn’t exist!
Governance
3. We have no unified way to tag and identify sensitive data/PII
The Problems
Documentation and discoverability
1. Data scientists and analysts found it difficult to learn what a table is or how and where its used
a. Single searchable source of documentation
b. Data catalogs can display dependencies between tables/dashboards
2. Documentation often doesn’t exist!
a. The catalog’s existence provides motivation to write documentation
b. The catalog can be used to track down popular but undocumented tables
Governance
3. We have no unified way to tag and identify sensitive data/PII
a. Provides a unified tag database for tags from all integrated data sources
b. Tags can be auto-generated by PII-detection tools, as well as manually added
Data catalog to the rescue
Amundsen is:
● 🙂 Simple - exactly the feature set we needed
● 🎉 Open source =
○ less expensive
○ highly customizable
○ can be integrated with anything
○ part of a community
We chose Amundsen because
❤ open source!
Implementing Amundsen
SSO with Okta
Custom
Looker
integration
K8s pods for each service, running behind a VPN
Airflow ETL DAG
runs once per
day
Neo4J 4.x
We decided not
to let people edit
descriptions
At least, for now)
We wanted documentation to be version
controlled, and we wanted to the code to be the
single source of truth.
Since people work in code a lot of the time, we felt
that it is important for Amundsen to work with code
documentation, not against it.
We intend to revisit this though!
Challenges we faced
Package
Requirements
Issues
We fixed these!
Amundsen Databuilder had a pretty strict
requirements.txt, making it difficult to:
1. Integrate into our existing Airflow codebase.
2. Combine it with other libraries
Eventually we unpinned most of the dependencies,
and contributed this upstream!
Neo4J 4.x
We’d love to fix this upstream!
We decided to use Neo4J 4.x, because:
1. Backups are easier
2. It would give us the opportunity to
potentially switch to a SAAS e.g. neo4j Aura
Amundsen does not support Neo4J 4.x, so we
made a few tweaks!
As a result, we’re currently using some forks.
Fun stuff we did!
Looker Integration
.LookML files in
Github
Models and Views;
SQL queries
Dashboards via
API
Dashboards
reference Models
and Views
Looker’s
Sources of
truth
Amundsen
?Git clone?
Looker Integration
.LookML files in
Github
Models and Views;
SQL queries
Dashboards via
API
Dashboards
reference Models
and Views
Looker’s
Sources of
truth
Amundsen
?Git clone?
Looker Integration
Looker data lineage
extraction code
Docker Image for
Artifact Builder
(in ECR)
Models/Views
In LookML repo
Amundsen Airflow DAG
Combines lookml view data
With API dashboard data
Lineage.json
(in S3)
CICD uses
docker image to
build lineage.json
Docker image
built in CICD
1. Build docker images which convert LookerML files to JSON lineage data
2. Use the Docker images in the source repository’s build pipeline to
create artifacts and upload them to S3
3. Consumers download the
artifacts at run-time, combine with
dashboard data from API
Dashboards from
Looker API
Looker Integration - Stage 1
Model/View files
(In looker Github repo)
Lineage.json
(in S3)
Parse LookML files and
build lineage graph
during CI.
Neo4J
Looker Integration - Stage 2
Model/View files
(In looker Github repo)
Lineage.json
(in S3)
Amundsen Airflow Pipeline
Combines lookml view data
With API dashboard data
Dashboards
(From Looker API)
Neo4J
Parse LookML files and
build lineage graph
during CI.
Query dashboards
using Looker SDK
Download lineage
data from S3
Looker Integration Data lineage -
Analysts love this!
Description
from looker
Looker Integration
Looker lineage
works in reverse
Custom Taggers
Documented/
Undocumented
tags
Category from
RBAC system
Custom Taggers
Customized Snowflake Integration
One query can pull all the metadata
we need from many tables across
multiple databases:
● Columns
● Last update time
● Documentation
● Table usage
What’s next for
Amundsen at Brex?
Collect user feedback
and usage statistics to
drive adoption
and documentation
Use Amundsen for PII
tagging
Contribute upstream 🤞
● Looker integration
● Neo4J 4.x support
● “Enriching Transformers”
● Useful taggers
Thanks for your time!
Any questions?

More Related Content

What's hot

Artificial Intelligence based Knowledge Management System - IBM Watson
Artificial Intelligence based Knowledge Management System - IBM WatsonArtificial Intelligence based Knowledge Management System - IBM Watson
Artificial Intelligence based Knowledge Management System - IBM WatsonThirdEye Data
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
 
Exploring ChatGPT For Effective Teaching
Exploring ChatGPT For Effective TeachingExploring ChatGPT For Effective Teaching
Exploring ChatGPT For Effective TeachingAdicodeTechnologies
 
Training Week: Introduction to Neo4j
Training Week: Introduction to Neo4jTraining Week: Introduction to Neo4j
Training Week: Introduction to Neo4jNeo4j
 
ChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdfChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdfssuser3e5d3a
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Databricks
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Sri Ambati
 
ChatGPT Prompt Tips
ChatGPT Prompt TipsChatGPT Prompt Tips
ChatGPT Prompt TipsJohn Allan
 
ChatGPT Cheatsheet 2023
ChatGPT Cheatsheet 2023ChatGPT Cheatsheet 2023
ChatGPT Cheatsheet 2023SaahilThakur
 
Montreal Girl Geeks: Building the Modern Web
Montreal Girl Geeks: Building the Modern WebMontreal Girl Geeks: Building the Modern Web
Montreal Girl Geeks: Building the Modern WebRachel Andrew
 

What's hot (20)

Artificial Intelligence based Knowledge Management System - IBM Watson
Artificial Intelligence based Knowledge Management System - IBM WatsonArtificial Intelligence based Knowledge Management System - IBM Watson
Artificial Intelligence based Knowledge Management System - IBM Watson
 
ChatGPT ChatBot
ChatGPT ChatBotChatGPT ChatBot
ChatGPT ChatBot
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Exploring ChatGPT For Effective Teaching
Exploring ChatGPT For Effective TeachingExploring ChatGPT For Effective Teaching
Exploring ChatGPT For Effective Teaching
 
OpenAI-Copilot-ChatGPT.pptx
OpenAI-Copilot-ChatGPT.pptxOpenAI-Copilot-ChatGPT.pptx
OpenAI-Copilot-ChatGPT.pptx
 
Training Week: Introduction to Neo4j
Training Week: Introduction to Neo4jTraining Week: Introduction to Neo4j
Training Week: Introduction to Neo4j
 
ChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdfChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdf
 
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
Behavior-Driven Development (BDD) Testing with Apache Spark with Aaron Colcor...
 
ChatGPT SEO Guide 2023
ChatGPT SEO Guide 2023ChatGPT SEO Guide 2023
ChatGPT SEO Guide 2023
 
Prompt Engineering
Prompt EngineeringPrompt Engineering
Prompt Engineering
 
Behind the Scenes of ChatGPT.pptx
Behind the Scenes of ChatGPT.pptxBehind the Scenes of ChatGPT.pptx
Behind the Scenes of ChatGPT.pptx
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
 
Unlocking the Power of ChatGPT
Unlocking the Power of ChatGPTUnlocking the Power of ChatGPT
Unlocking the Power of ChatGPT
 
ChatGPT Prompt Tips
ChatGPT Prompt TipsChatGPT Prompt Tips
ChatGPT Prompt Tips
 
Ai chatbot
Ai chatbotAi chatbot
Ai chatbot
 
Google
GoogleGoogle
Google
 
ChatGPT Cheatsheet 2023
ChatGPT Cheatsheet 2023ChatGPT Cheatsheet 2023
ChatGPT Cheatsheet 2023
 
Artificial intelligence, machine learning and internet of things
Artificial intelligence, machine learning and internet of thingsArtificial intelligence, machine learning and internet of things
Artificial intelligence, machine learning and internet of things
 
Montreal Girl Geeks: Building the Modern Web
Montreal Girl Geeks: Building the Modern WebMontreal Girl Geeks: Building the Modern Web
Montreal Girl Geeks: Building the Modern Web
 

Similar to Amundsen at Brex and Looker integration

Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
HLoader – Automated Incremental Hadoop Data Loader Service and Framework
HLoader – Automated Incremental Hadoop Data Loader Service and FrameworkHLoader – Automated Incremental Hadoop Data Loader Service and Framework
HLoader – Automated Incremental Hadoop Data Loader Service and FrameworkDániel Stein
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using DockerIRJET Journal
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceAntonio García-Domínguez
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsMike Broberg
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkAnant Corporation
 
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes
Sumo Logic Cert Jam - Advanced Metrics with KubernetesSumo Logic Cert Jam - Advanced Metrics with Kubernetes
Sumo Logic Cert Jam - Advanced Metrics with KubernetesSumo Logic
 
OpenProdoc Overview
OpenProdoc OverviewOpenProdoc Overview
OpenProdoc Overviewjhierrot
 
Spring data jpa are used to develop spring applications
Spring data jpa are used to develop spring applicationsSpring data jpa are used to develop spring applications
Spring data jpa are used to develop spring applicationsmichaelaaron25322
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Ricard de la Vega
 
ESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialJonathon Hare
 
Microsoft Azure News - Dec 2016
Microsoft Azure News - Dec 2016Microsoft Azure News - Dec 2016
Microsoft Azure News - Dec 2016Daniel Toomey
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Michael Rys
 
ArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudMicrosoft ArcReady
 

Similar to Amundsen at Brex and Looker integration (20)

Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Angular meteor presentation
Angular meteor presentationAngular meteor presentation
Angular meteor presentation
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
HLoader – Automated Incremental Hadoop Data Loader Service and Framework
HLoader – Automated Incremental Hadoop Data Loader Service and FrameworkHLoader – Automated Incremental Hadoop Data Loader Service and Framework
HLoader – Automated Incremental Hadoop Data Loader Service and Framework
 
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET-  	  Hosting NLP based Chatbot on AWS Cloud using DockerIRJET-  	  Hosting NLP based Chatbot on AWS Cloud using Docker
IRJET- Hosting NLP based Chatbot on AWS Cloud using Docker
 
Linkers in compiler
Linkers in compilerLinkers in compiler
Linkers in compiler
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a serviceCOMMitMDE'18: Eclipse Hawk: model repository querying as a service
COMMitMDE'18: Eclipse Hawk: model repository querying as a service
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 Questions
 
Data Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and SparkData Engineer's Lunch #54: dbt and Spark
Data Engineer's Lunch #54: dbt and Spark
 
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes
Sumo Logic Cert Jam - Advanced Metrics with KubernetesSumo Logic Cert Jam - Advanced Metrics with Kubernetes
Sumo Logic Cert Jam - Advanced Metrics with Kubernetes
 
OpenProdoc Overview
OpenProdoc OverviewOpenProdoc Overview
OpenProdoc Overview
 
Spring data jpa are used to develop spring applications
Spring data jpa are used to develop spring applicationsSpring data jpa are used to develop spring applications
Spring data jpa are used to develop spring applications
 
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
Technical Challenges and Approaches to Build an Open Ecosystem of Heterogeneo...
 
ESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorial
 
Microsoft Azure News - Dec 2016
Microsoft Azure News - Dec 2016Microsoft Azure News - Dec 2016
Microsoft Azure News - Dec 2016
 
Creating a SPA blog withAngular and Cloud Firestore
Creating a SPA blog withAngular and Cloud FirestoreCreating a SPA blog withAngular and Cloud Firestore
Creating a SPA blog withAngular and Cloud Firestore
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
Running cost effective big data workloads with Azure Synapse and ADLS (MS Ign...
 
ArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudArcReady - Architecting For The Cloud
ArcReady - Architecting For The Cloud
 

More from markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futuremarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 

More from markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
The Lyft data platform: Now and in the future
The Lyft data platform: Now and in the futureThe Lyft data platform: Now and in the future
The Lyft data platform: Now and in the future
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 

Recently uploaded

AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfdanishmna97
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Recently uploaded (20)

AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Amundsen at Brex and Looker integration

  • 1.
  • 3. Why Amundsen? (why data catalog at all?
  • 4. Documentation and discoverability 1. Data scientists and analysts found it difficult to learn what a table is or how and where its used 2. Documentation often doesn’t exist! Governance 3. We have no unified way to tag and identify sensitive data/PII The Problems
  • 5. Documentation and discoverability 1. Data scientists and analysts found it difficult to learn what a table is or how and where its used a. Single searchable source of documentation b. Data catalogs can display dependencies between tables/dashboards 2. Documentation often doesn’t exist! a. The catalog’s existence provides motivation to write documentation b. The catalog can be used to track down popular but undocumented tables Governance 3. We have no unified way to tag and identify sensitive data/PII a. Provides a unified tag database for tags from all integrated data sources b. Tags can be auto-generated by PII-detection tools, as well as manually added Data catalog to the rescue
  • 6. Amundsen is: ● 🙂 Simple - exactly the feature set we needed ● 🎉 Open source = ○ less expensive ○ highly customizable ○ can be integrated with anything ○ part of a community We chose Amundsen because ❤ open source!
  • 8. SSO with Okta Custom Looker integration K8s pods for each service, running behind a VPN Airflow ETL DAG runs once per day Neo4J 4.x
  • 9. We decided not to let people edit descriptions At least, for now) We wanted documentation to be version controlled, and we wanted to the code to be the single source of truth. Since people work in code a lot of the time, we felt that it is important for Amundsen to work with code documentation, not against it. We intend to revisit this though!
  • 11. Package Requirements Issues We fixed these! Amundsen Databuilder had a pretty strict requirements.txt, making it difficult to: 1. Integrate into our existing Airflow codebase. 2. Combine it with other libraries Eventually we unpinned most of the dependencies, and contributed this upstream!
  • 12. Neo4J 4.x We’d love to fix this upstream! We decided to use Neo4J 4.x, because: 1. Backups are easier 2. It would give us the opportunity to potentially switch to a SAAS e.g. neo4j Aura Amundsen does not support Neo4J 4.x, so we made a few tweaks! As a result, we’re currently using some forks.
  • 13. Fun stuff we did!
  • 14. Looker Integration .LookML files in Github Models and Views; SQL queries Dashboards via API Dashboards reference Models and Views Looker’s Sources of truth Amundsen ?Git clone?
  • 15. Looker Integration .LookML files in Github Models and Views; SQL queries Dashboards via API Dashboards reference Models and Views Looker’s Sources of truth Amundsen ?Git clone?
  • 16. Looker Integration Looker data lineage extraction code Docker Image for Artifact Builder (in ECR) Models/Views In LookML repo Amundsen Airflow DAG Combines lookml view data With API dashboard data Lineage.json (in S3) CICD uses docker image to build lineage.json Docker image built in CICD 1. Build docker images which convert LookerML files to JSON lineage data 2. Use the Docker images in the source repository’s build pipeline to create artifacts and upload them to S3 3. Consumers download the artifacts at run-time, combine with dashboard data from API Dashboards from Looker API
  • 17. Looker Integration - Stage 1 Model/View files (In looker Github repo) Lineage.json (in S3) Parse LookML files and build lineage graph during CI. Neo4J
  • 18. Looker Integration - Stage 2 Model/View files (In looker Github repo) Lineage.json (in S3) Amundsen Airflow Pipeline Combines lookml view data With API dashboard data Dashboards (From Looker API) Neo4J Parse LookML files and build lineage graph during CI. Query dashboards using Looker SDK Download lineage data from S3
  • 19. Looker Integration Data lineage - Analysts love this! Description from looker
  • 23. Customized Snowflake Integration One query can pull all the metadata we need from many tables across multiple databases: ● Columns ● Last update time ● Documentation ● Table usage
  • 25. Collect user feedback and usage statistics to drive adoption and documentation
  • 26. Use Amundsen for PII tagging
  • 27. Contribute upstream 🤞 ● Looker integration ● Neo4J 4.x support ● “Enriching Transformers” ● Useful taggers
  • 28. Thanks for your time! Any questions?