SlideShare a Scribd company logo
REA Group's Journey with Data Cataloging
2020.11.05
How do you pronounce Amundsen?
• American way != Australian way != Norwegian way
Agenda
• Why we needed a data catalog and why we chose Amundsen
• An overview of our implementation
• User feedback and customisations
• What's next on our roadmap
Alex Kompos
Data Developer
Abhinay Kathuria
Data Developer
Stacy Sterling
Data Manager
Why we needed a data catalog
• REA Group is Australia's largest property advertising portal
• 1,400 employees
• ~500 developers
• ~50 analysts & data scientists
Why we needed a data catalog
Our stack
Why we chose Amundsen
Pros
• Most of our "must have" features were already available (integration with BigQuery and Airflow)
• Flexiblity to customise and build features we needed
• Doesn't rely on manual curation which can become outdated quickly
• Allows users to search for data they don't already have access to
• Clean, intuitive UI
• Opportunity for our team to contribute back to an open-source project
Considerations
• Lacked features that the vendor solutions offered (business metrics glossary, column-level
lineage)
• Our team did not have much front-end development experience
• We didn't know how long implementation might take
How did we implement
• Implemented a POC last year as
a Hackathon Project
• Wanted to Productionize an MVP
• Get alpha user feedback
• Release to the wider community
Deployment Stack
• AWS ECS for each service
• Neo4j Backend running on EC2
• AWS Managed Elasticsearch
• EFS Storage for Neo4j
Metadata Extraction
• Using Breeze (Internal ETL as a service tool)
• Running a DAG daily
• Scrape data from Google BigQuery
What customisations did we make?
• Amundsen is built to be company agnostic
• Each company has a different data culture, data maturity level and
domains.
• Over 12 changes to Amundsen
• Based on feedback from alpha users
• Changes that relate to a broader audience will up streamed
How did we implement the changes?
• Customisation are done by building a custom docker image
• Any changes to source files are then patched when building the image
• We mirror the folder structure on mainline
• Patching is ”cheap”
• Will be annoying to deal with version upgrades with large refactors
• Forking might be easier in the future
Summary of frontend changes
Separating service accounts & frequent users
• Our users look to Frequent Users to find domain experts however it was
pollulted by our service which don’t provide much context
• E.g vaultxxxxx-xxxxxx--xxxxxx@xxx-xxx-xxxx.iam.gserviceaccount.com
• This was achieve by filtering out users with “gserviceaccount”
• Unsure if this feature would be useful to the broader community
Advance search
Amundsen 2.3.0REA version
• Tool tips that resonated with our users
• Used “BigQuery” Language
• Remove non applicable filters
• Done through the frontend config
Partition Columns
Amundsen 2.3.0REA version
• Confusion with partition ranges
came up.
• Used “BigQuery” Language
• Defaults to “Non-Partitioned Table”
What's next on the menu Amundsen at REA?
Coming up next
• Authentication & authorization (RBAC)
• Preview feature, bookmarks
• Surface Breeze metadata
• Breeze is our ETL as Airflow-based ETL for job orchestration YAML-based abstraction layer
• Data Linage umbrella
• Input/Output tables, transformation logic, schedules
• Ties into our broader Meta Data Strategy
• Meta data stored in either BigQuery table or Kafka
Also in our backlog (not high priority)
• Enforcing table & field descriptions through Breeze
• Adding programmatic descriptions
• Improving the way search results are displayed
• Table-level lineage
• Implementing a tagging strategy
• Integration with a business metrics glossary
• Integration with Tableau Server
• Integration with Kafka topics
Questions?

More Related Content

What's hot

Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integrationnguyenfilip
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
Tao Feng
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
Jun Rao
 
Data Security and Protection in DevOps
Data Security and Protection in DevOps Data Security and Protection in DevOps
Data Security and Protection in DevOps
Karen Lopez
 
Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
Infochimps, a CSC Big Data Business
 
Brokering Data: Accelerating Data Evaluation with Databricks White Label
Brokering Data: Accelerating Data Evaluation with Databricks White LabelBrokering Data: Accelerating Data Evaluation with Databricks White Label
Brokering Data: Accelerating Data Evaluation with Databricks White Label
Databricks
 
Top 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLTop 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLMongoDB
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Search Technologies
 
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINEFelix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
semanticsconference
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
Trieu Nguyen
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Spark Summit
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
Jesus Rodriguez
 
Join 2017_Deep Dive_Smart Caching
Join 2017_Deep Dive_Smart CachingJoin 2017_Deep Dive_Smart Caching
Join 2017_Deep Dive_Smart Caching
Looker
 
Evaluation criteria for nosql databases
Evaluation criteria for nosql databasesEvaluation criteria for nosql databases
Evaluation criteria for nosql databases
Ebenezer Daniel
 
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Olga Zinkevych
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
DataWorks Summit
 
Automatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriAutomatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia Kalavri
Flink Forward
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
 

What's hot (20)

Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integration
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)LinkedIn Infrastructure (analytics@webscale, at fb 2013)
LinkedIn Infrastructure (analytics@webscale, at fb 2013)
 
Data Security and Protection in DevOps
Data Security and Protection in DevOps Data Security and Protection in DevOps
Data Security and Protection in DevOps
 
Vayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex SystemsVayacondios: Divine into Complex Systems
Vayacondios: Divine into Complex Systems
 
Brokering Data: Accelerating Data Evaluation with Databricks White Label
Brokering Data: Accelerating Data Evaluation with Databricks White LabelBrokering Data: Accelerating Data Evaluation with Databricks White Label
Brokering Data: Accelerating Data Evaluation with Databricks White Label
 
Top 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQLTop 5 Considerations When Evaluating NoSQL
Top 5 Considerations When Evaluating NoSQL
 
Enterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for SearchEnterprise Search Summit Keynote: A Big Data Architecture for Search
Enterprise Search Summit Keynote: A Big Data Architecture for Search
 
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINEFelix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
Felix Burkhardt | ARCHITECTURE FOR A QUESTION ANSWERING MACHINE
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
 
Stream Analytics in the Enterprise
Stream Analytics in the EnterpriseStream Analytics in the Enterprise
Stream Analytics in the Enterprise
 
Join 2017_Deep Dive_Smart Caching
Join 2017_Deep Dive_Smart CachingJoin 2017_Deep Dive_Smart Caching
Join 2017_Deep Dive_Smart Caching
 
Evaluation criteria for nosql databases
Evaluation criteria for nosql databasesEvaluation criteria for nosql databases
Evaluation criteria for nosql databases
 
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18Azure data catalog your data your way eugene polonichko dataconf 21 04 18
Azure data catalog your data your way eugene polonichko dataconf 21 04 18
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Automatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia KalavriAutomatic Detection of Web Trackers by Vasia Kalavri
Automatic Detection of Web Trackers by Vasia Kalavri
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 

Similar to REA Group's journey with Data Cataloging and Amundsen

AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
paidi_ed
 
Beyond DevOps: How Netflix Bridges the Gap?
Beyond DevOps: How Netflix Bridges the Gap?Beyond DevOps: How Netflix Bridges the Gap?
Beyond DevOps: How Netflix Bridges the Gap?
C4Media
 
Service Architectures at Scale
Service Architectures at ScaleService Architectures at Scale
Service Architectures at Scale
Randy Shoup
 
Kicktag - About Kicktag & Cosmos 2014
Kicktag - About Kicktag & Cosmos 2014Kicktag - About Kicktag & Cosmos 2014
Kicktag - About Kicktag & Cosmos 2014
Kicktag Web Solutions Ltd
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
Access Innovations, Inc.
 
Agile Content Development and the IXIASOFT DITA CMS
Agile Content Development and the IXIASOFT DITA CMSAgile Content Development and the IXIASOFT DITA CMS
Agile Content Development and the IXIASOFT DITA CMS
IXIASOFT
 
Webinar: Ten Ways to Enhance Your Salesforce.com Application in 2013
Webinar: Ten Ways to Enhance Your Salesforce.com Application in 2013Webinar: Ten Ways to Enhance Your Salesforce.com Application in 2013
Webinar: Ten Ways to Enhance Your Salesforce.com Application in 2013
Emtec Inc.
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
AWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWS
Amazon Web Services
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
Amazon Web Services Korea
 
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter AutomationFishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl Solutions
 
Metadata & Interoperability: Free Tools
Metadata & Interoperability: Free ToolsMetadata & Interoperability: Free Tools
Metadata & Interoperability: Free Tools
Mike Jennings
 
WebCenter Content 11g Upgrade Webinar - March 2013
WebCenter Content 11g Upgrade Webinar - March 2013WebCenter Content 11g Upgrade Webinar - March 2013
WebCenter Content 11g Upgrade Webinar - March 2013
Fishbowl Solutions
 
When small problems become big problems
When small problems become big problemsWhen small problems become big problems
When small problems become big problems
Adrian Cole
 
Play Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a ProposalPlay Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a Proposal
Mike Slinn
 
Alfresco Day Milano 2016 - Alfresco Product Update
Alfresco Day Milano 2016 - Alfresco Product UpdateAlfresco Day Milano 2016 - Alfresco Product Update
Alfresco Day Milano 2016 - Alfresco Product Update
Alfresco Software
 
Keeping in Touch -- Collaborative Technologies
Keeping in Touch -- Collaborative TechnologiesKeeping in Touch -- Collaborative Technologies
Keeping in Touch -- Collaborative Technologies
IABC Houston
 
Integrate Applications into IBM Connections Cloud and On Premises (AD 1632)
Integrate Applications into IBM Connections Cloud and On Premises (AD 1632)Integrate Applications into IBM Connections Cloud and On Premises (AD 1632)
Integrate Applications into IBM Connections Cloud and On Premises (AD 1632)
TIMETOACT GROUP
 
Design Systems at Scale
Design Systems at ScaleDesign Systems at Scale
Design Systems at Scale
Sarah Federman
 

Similar to REA Group's journey with Data Cataloging and Amundsen (20)

AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
 
Beyond DevOps: How Netflix Bridges the Gap?
Beyond DevOps: How Netflix Bridges the Gap?Beyond DevOps: How Netflix Bridges the Gap?
Beyond DevOps: How Netflix Bridges the Gap?
 
Service Architectures at Scale
Service Architectures at ScaleService Architectures at Scale
Service Architectures at Scale
 
Kicktag - About Kicktag & Cosmos 2014
Kicktag - About Kicktag & Cosmos 2014Kicktag - About Kicktag & Cosmos 2014
Kicktag - About Kicktag & Cosmos 2014
 
Data harmony update 2021
Data harmony update 2021 Data harmony update 2021
Data harmony update 2021
 
Agile Content Development and the IXIASOFT DITA CMS
Agile Content Development and the IXIASOFT DITA CMSAgile Content Development and the IXIASOFT DITA CMS
Agile Content Development and the IXIASOFT DITA CMS
 
Webinar: Ten Ways to Enhance Your Salesforce.com Application in 2013
Webinar: Ten Ways to Enhance Your Salesforce.com Application in 2013Webinar: Ten Ways to Enhance Your Salesforce.com Application in 2013
Webinar: Ten Ways to Enhance Your Salesforce.com Application in 2013
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
AWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWSAWS Summit Auckland - Smaller is Better - Microservices on AWS
AWS Summit Auckland - Smaller is Better - Microservices on AWS
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig DicksonAWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
 
Fishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter AutomationFishbowl's Packaged Tools for WebCenter Automation
Fishbowl's Packaged Tools for WebCenter Automation
 
Metadata & Interoperability: Free Tools
Metadata & Interoperability: Free ToolsMetadata & Interoperability: Free Tools
Metadata & Interoperability: Free Tools
 
WebCenter Content 11g Upgrade Webinar - March 2013
WebCenter Content 11g Upgrade Webinar - March 2013WebCenter Content 11g Upgrade Webinar - March 2013
WebCenter Content 11g Upgrade Webinar - March 2013
 
When small problems become big problems
When small problems become big problemsWhen small problems become big problems
When small problems become big problems
 
Thinakaran
ThinakaranThinakaran
Thinakaran
 
Play Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a ProposalPlay Architecture, Implementation, Shiny Objects, and a Proposal
Play Architecture, Implementation, Shiny Objects, and a Proposal
 
Alfresco Day Milano 2016 - Alfresco Product Update
Alfresco Day Milano 2016 - Alfresco Product UpdateAlfresco Day Milano 2016 - Alfresco Product Update
Alfresco Day Milano 2016 - Alfresco Product Update
 
Keeping in Touch -- Collaborative Technologies
Keeping in Touch -- Collaborative TechnologiesKeeping in Touch -- Collaborative Technologies
Keeping in Touch -- Collaborative Technologies
 
Integrate Applications into IBM Connections Cloud and On Premises (AD 1632)
Integrate Applications into IBM Connections Cloud and On Premises (AD 1632)Integrate Applications into IBM Connections Cloud and On Premises (AD 1632)
Integrate Applications into IBM Connections Cloud and On Premises (AD 1632)
 
Design Systems at Scale
Design Systems at ScaleDesign Systems at Scale
Design Systems at Scale
 

More from markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
markgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
markgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
markgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
markgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
markgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
markgrover
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
markgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
markgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
markgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
markgrover
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
markgrover
 

More from markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 
Application architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MRApplication architectures with Hadoop and Sessionization in MR
Application architectures with Hadoop and Sessionization in MR
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

REA Group's journey with Data Cataloging and Amundsen

  • 1. REA Group's Journey with Data Cataloging 2020.11.05
  • 2. How do you pronounce Amundsen? • American way != Australian way != Norwegian way
  • 3. Agenda • Why we needed a data catalog and why we chose Amundsen • An overview of our implementation • User feedback and customisations • What's next on our roadmap Alex Kompos Data Developer Abhinay Kathuria Data Developer Stacy Sterling Data Manager
  • 4. Why we needed a data catalog • REA Group is Australia's largest property advertising portal • 1,400 employees • ~500 developers • ~50 analysts & data scientists
  • 5. Why we needed a data catalog
  • 7. Why we chose Amundsen Pros • Most of our "must have" features were already available (integration with BigQuery and Airflow) • Flexiblity to customise and build features we needed • Doesn't rely on manual curation which can become outdated quickly • Allows users to search for data they don't already have access to • Clean, intuitive UI • Opportunity for our team to contribute back to an open-source project Considerations • Lacked features that the vendor solutions offered (business metrics glossary, column-level lineage) • Our team did not have much front-end development experience • We didn't know how long implementation might take
  • 8. How did we implement • Implemented a POC last year as a Hackathon Project • Wanted to Productionize an MVP • Get alpha user feedback • Release to the wider community
  • 9. Deployment Stack • AWS ECS for each service • Neo4j Backend running on EC2 • AWS Managed Elasticsearch • EFS Storage for Neo4j
  • 10. Metadata Extraction • Using Breeze (Internal ETL as a service tool) • Running a DAG daily • Scrape data from Google BigQuery
  • 11. What customisations did we make? • Amundsen is built to be company agnostic • Each company has a different data culture, data maturity level and domains. • Over 12 changes to Amundsen • Based on feedback from alpha users • Changes that relate to a broader audience will up streamed
  • 12. How did we implement the changes? • Customisation are done by building a custom docker image • Any changes to source files are then patched when building the image • We mirror the folder structure on mainline • Patching is ”cheap” • Will be annoying to deal with version upgrades with large refactors • Forking might be easier in the future
  • 14. Separating service accounts & frequent users • Our users look to Frequent Users to find domain experts however it was pollulted by our service which don’t provide much context • E.g vaultxxxxx-xxxxxx--xxxxxx@xxx-xxx-xxxx.iam.gserviceaccount.com • This was achieve by filtering out users with “gserviceaccount” • Unsure if this feature would be useful to the broader community
  • 15. Advance search Amundsen 2.3.0REA version • Tool tips that resonated with our users • Used “BigQuery” Language • Remove non applicable filters • Done through the frontend config
  • 16. Partition Columns Amundsen 2.3.0REA version • Confusion with partition ranges came up. • Used “BigQuery” Language • Defaults to “Non-Partitioned Table”
  • 17. What's next on the menu Amundsen at REA? Coming up next • Authentication & authorization (RBAC) • Preview feature, bookmarks • Surface Breeze metadata • Breeze is our ETL as Airflow-based ETL for job orchestration YAML-based abstraction layer • Data Linage umbrella • Input/Output tables, transformation logic, schedules • Ties into our broader Meta Data Strategy • Meta data stored in either BigQuery table or Kafka
  • 18. Also in our backlog (not high priority) • Enforcing table & field descriptions through Breeze • Adding programmatic descriptions • Improving the way search results are displayed • Table-level lineage • Implementing a tagging strategy • Integration with a business metrics glossary • Integration with Tableau Server • Integration with Kafka topics