REA Group's journey with Data Cataloging and Amundsen

REA Group's Journey with Data Cataloging
2020.11.05

How do you pronounce Amundsen?
• American way != Australian way != Norwegian way

Agenda
• Why we needed a data catalog and why we chose Amundsen
• An overview of our implementation
• User feedback and customisations
• What's next on our roadmap
Alex Kompos
Data Developer
Abhinay Kathuria
Data Developer
Stacy Sterling
Data Manager

Why we needed a data catalog
• REA Group is Australia's largest property advertising portal
• 1,400 employees
• ~500 developers
• ~50 analysts & data scientists

Why we chose Amundsen
Pros
• Most of our "must have" features were already available (integration with BigQuery and Airflow)
• Flexiblity to customise and build features we needed
• Doesn't rely on manual curation which can become outdated quickly
• Allows users to search for data they don't already have access to
• Clean, intuitive UI
• Opportunity for our team to contribute back to an open-source project
Considerations
• Lacked features that the vendor solutions offered (business metrics glossary, column-level
lineage)
• Our team did not have much front-end development experience
• We didn't know how long implementation might take

How did we implement
• Implemented a POC last year as
a Hackathon Project
• Wanted to Productionize an MVP
• Get alpha user feedback
• Release to the wider community

Deployment Stack
• AWS ECS for each service
• Neo4j Backend running on EC2
• AWS Managed Elasticsearch
• EFS Storage for Neo4j

Metadata Extraction
• Using Breeze (Internal ETL as a service tool)
• Running a DAG daily
• Scrape data from Google BigQuery

What customisations did we make?
• Amundsen is built to be company agnostic
• Each company has a different data culture, data maturity level and
domains.
• Over 12 changes to Amundsen
• Based on feedback from alpha users
• Changes that relate to a broader audience will up streamed

How did we implement the changes?
• Customisation are done by building a custom docker image
• Any changes to source files are then patched when building the image
• We mirror the folder structure on mainline
• Patching is ”cheap”
• Will be annoying to deal with version upgrades with large refactors
• Forking might be easier in the future

Separating service accounts & frequent users
• Our users look to Frequent Users to find domain experts however it was
pollulted by our service which don’t provide much context
• E.g vaultxxxxx-xxxxxx--xxxxxx@xxx-xxx-xxxx.iam.gserviceaccount.com
• This was achieve by filtering out users with “gserviceaccount”
• Unsure if this feature would be useful to the broader community

Advance search
Amundsen 2.3.0REA version
• Tool tips that resonated with our users
• Used “BigQuery” Language
• Remove non applicable filters
• Done through the frontend config

Partition Columns
Amundsen 2.3.0REA version
• Confusion with partition ranges
came up.
• Used “BigQuery” Language
• Defaults to “Non-Partitioned Table”

What's next on the menu Amundsen at REA?
Coming up next
• Authentication & authorization (RBAC)
• Preview feature, bookmarks
• Surface Breeze metadata
• Breeze is our ETL as Airflow-based ETL for job orchestration YAML-based abstraction layer
• Data Linage umbrella
• Input/Output tables, transformation logic, schedules
• Ties into our broader Meta Data Strategy
• Meta data stored in either BigQuery table or Kafka

Also in our backlog (not high priority)
• Enforcing table & field descriptions through Breeze
• Adding programmatic descriptions
• Improving the way search results are displayed
• Table-level lineage
• Implementing a tagging strategy
• Integration with a business metrics glossary
• Integration with Tableau Server
• Integration with Kafka topics

REA Group's journey with Data Cataloging and Amundsen

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to REA Group's journey with Data Cataloging and Amundsen

Similar to REA Group's journey with Data Cataloging and Amundsen (20)

More from markgrover

More from markgrover (20)

Recently uploaded

Recently uploaded (20)

REA Group's journey with Data Cataloging and Amundsen