1. Hari Karthikeyan
Data Engineering Intern
Fall Intern Presentation
December 15th, 2021
Manager: Ronak Shah, Director, Data Engineering
Mentor: Alok Shenoy, Senior Data Engineer
3. 3
About Me
- Computer Engineering @
University of Waterloo
- Graduating April 2022
- Huge soccer and basketball fan!
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
4. Starter Projects
Implemented a MEGA Export job in
Airflow
● Created an Airflow operator and a test DAG to
move data between two redshift instances
Expanded LaForge module to send
out alerts via pagerduty
● Integrated PD plugin into LaForge; documented,
tested and deployed the changes to ensure
alerting when DAG’s don’t meet their SLA’s
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
4
How it started…
5. 5
Argos Project Overview
Goal: Data Quality and Anomaly Detection system that will be central to
implementing the trust but verify principle across all data pipelines at Coursera
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
6. Process
Planning Development Testing Demo Deployment
6
Technical
Design
Document
Storage and
persistence layer
Architectural design
sessions
Stage-check-e
xchange
principle
Data Quality check
persistence operator
Anomaly Detection
operator
Predictions
Databricks notebook
Argos plugin
Airflow DAG’s
and data
backfilling
DAG’s to perform
end-to-end testing
Backfilling of test
data in EDW/EDS
Impressions
demo pipeline
with dbt
Integration with dbt
to handle anomalies
Circuit-breaker
functionality of Argos
Productionizing
Argos
Databricks jobs to move
Argos metadata from raw
layer -> L0 -> L1
Extensive documentation
- Operator development
guideline
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
7. 7
Argos
Motivation
● More pipelines built upon the core data sets
leads to data quality being impacted
● Moving away from MEGA means we need a
system to rely on Airflow to carry out
blocking/non-blocking checks
● Need for an extensible framework to act as a
circuit-breaker in all data pipelines
● Ability to run data quality checks and
anomaly detection on EDW/EDS tables
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
8. 8
Argos
Implementation
● Plugin to parse a config file and inject tasks
into a DAG with dependencies
● Operator to persist check results as JSON in
raw layer S3 bucket
● Operator to perform anomaly detection by
comparing today’s check result to latest
prediction result in L1 layer
● Databricks notebook to generate lower and
upper prediction bounds based on historical
check results data
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
9. 9
Argos
Challenges
● Finalizing technical design and project
planning with so many moving parts
● Issues with local Airflow env setup; used
the dev Airflow cluster for testing
● Connecting Airflow to Databricks in
order to trigger a notebook run
● Plenty of edge cases to consider;
performed iterative development
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
10. Argos check
results metadata
Example
This table (l1.argos_check_results) has the
row count check results for all tables run
from various DAG’s
10
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
11. Place your screenshot here
11
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
Argos prediction
results metadata
Example
This table (l1.argos_prediction_results) has
the row count prediction results for all tables
run from various DAG’s
12. Argos Project
UI to add checks on
tables and a Looker
dashboard to
visualize them
12
Next Steps
Argos as a central
microservice (flask
application)
Consolidation of
Argos logs by
integrating with
project Helios
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
Sophisticated
predictions and
anomaly detection
13. 13
Value of extensive documentation
and testing; monitoring after
deployment of ETL pipelines
Importance of architectural
design sessions; teamwork and
collaboration
Python, SQL, Airflow, Databricks,
AWS, dbt, Docker, Terraform
Key Learnings and Takeaways
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
14. 14
MAT 16 Project: SkillsMatch
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements
● Full-service destination hub for career discovery, skills development, and job matching
● Mapping learner skills and skills proficiency level to job openings and addressing skill gaps
through a learner portal
● Filtering Coursera learners based on skills and targeting specific learner segments to fill job
openings with qualified candidates through an employer portal
● Extracting real-world job data (skills, description, URL, salary, etc.) from Burning Glass API’s
● Backend data model to facilitate matching algorithm and job/course recommendations based
on users skill scores
15. Thank you!
● Mentor: Alok
● Manager: Ronak
● Demo: Chibu
● MAT: Steven, Simon (DE), entire SkillsMatch team
● Entire DE team
● Brown Bag Series
Introduction Starter Projects Argos Project MAT 16 Project Acknowledgements