By Evgeny Shulman
CTO at databand.ai
● Building Data Pipelines since 2004
● Founding employee at Crosswise (acquired by Oracle Data Cloud)
○ Luigi, Airﬂow, all possible Spark Deployments, taking ML
pipelining to the extreme.
● CTO at Databand.ai, father of two boys, enjoying both! :)
REAL DATA PIPELINES
○ Spark, tensorﬂow, python, SQL..
○ formats, schemas, versions
○ pipelines of pipelines, wiring..
Real data pipelines often fail, take forever to
rerun and provide no idea of what’s wrong!
Because of... Code Changes
● DAGs changes a lot too!
● A side-effect is somewhere in downstream..
→ OBSERVABILITY NEEDED!
Because of... Data Changes
● Data Schema changes
● Data Quality Changes
● Data Corruptions
→ OBSERVABILITY NEEDED!
take 1 - Production Engineering Approach
Production Metrics grafana, kibana
Production Logging loggly, logz.io, others
Production Alerting datadog, zabbix, nagios, grafana alerts
take 2 - Data Science Approach
Experimentation Management by an external system ( mlﬂow, sacred,
sagemaker, and many-many others)
Encapsulating reporting into Notebooks
take 3 - Data Ops Engineering Approach
Data management on its own
Job submission on its own
Validation Operators ( + external frameworks)
take N: Mix and Match
Understand your customer
● Metrics, metrics, metrics
○ Start from data inputs - 90% of your bugs are somewhere in data
○ Spark job/Tensorﬂow job is not a black box! (collect metrics)
● Build Comparison Methodology
○ Grafana dashboards
○ 1 to 1 compare!
○ Data pipelines are a huge Engineering investment
○ Don’t be afraid of having multiple systems
○ Implement your own! (know how and when to Fork)
Airﬂow Operation vs Business Metrics
Connect Airﬂow STATSD, great for cluster monitoring!
Use Airﬂow Trends!
Not sure about inlets/outlets in BaseOperator
Your KEY and x-axes
● Treat your system as a BATCH system, not as 24/7
○ Restarts will happen
○ Scheduling and SLA
● Scheduled time, execution time, restart time
Your KEY name
Is it just a NAME?
Or more like ENV.PROJECT.PIPELINE.CLIENT.TASK_ID ?
● Metrics2.0 to the Rescue! (use labels)
● You’ll have similar tasks in the pipeline!
● You’ll run the same tasks in development!
● Data processes are no different from your FrontEnd Customer Facing
○ You MUST monitor and alert!
● Pagerduty! Slack channel!
● (read good practices on alerting)
● Your jobs have Discrete metrics; your Alerting system will not like that!
● Ask your team to develop stable KPIs .
What we didn’t discuss
Advanced Alerting on Data Pipelines
How to reuse Production observability in Development and vice versa
What we did discuss
Start from basics: instrument your code!