Data Pipline Observability meetup

Data Pipeline
Observability
By Evgeny Shulman
CTO at databand.ai

About Me
● Building Data Pipelines since 2004
● Founding employee at Crosswise (acquired by Oracle Data Cloud)
○ Luigi, Airﬂow, all possible Spark Deployments, taking ML
pipelining to the extreme.
● CTO at Databand.ai, father of two boys, enjoying both! :)

REAL DATA PIPELINES
● TOOLS
○ Spark, tensorﬂow, python, SQL..
● DATA
○ formats, schemas, versions
○ sources
● COMPLEXITY
○ pipelines of pipelines, wiring..
Real data pipelines often fail, take forever to
rerun and provide no idea of what’s wrong!

Because of... Code Changes
● DAGs changes a lot too!
● CI/CD?!
● A side-effect is somewhere in downstream..
→ OBSERVABILITY NEEDED!

Because of... Data Changes
● Data Schema changes
● Data Quality Changes
● Data Corruptions

Because… Everything Changes!
● Python Dependency resolving: Pandas 1.0 :)
● Clusters
● Internal Libraries

9
What we love
● Flexibility
● Scheduling
● Community
● Maturity
Airﬂow in ML/Data? (What was it built for?)

Does it solve the PROBLEM?
(Yes! And No..)

take 1 - Production Engineering Approach
Production Metrics grafana, kibana
Production Logging loggly, logz.io, others
Production Alerting datadog, zabbix, nagios, grafana alerts

take 2 - Data Science Approach
Experimentation Management by an external system ( mlﬂow, sacred,
sagemaker, and many-many others)
Encapsulating reporting into Notebooks

take 3 - Data Ops Engineering Approach
Inhouse development
Custom Operators
Data management on its own
Job submission on its own
Validation Operators ( + external frameworks)

take N: Mix and Match
Understand your customer

Measure Everything
● Metrics, metrics, metrics
○ Start from data inputs - 90% of your bugs are somewhere in data
ingestion
○ Spark job/Tensorﬂow job is not a black box! (collect metrics)
● Build Comparison Methodology
○ Grafana dashboards
○ 1 to 1 compare!
● Develop
○ Data pipelines are a huge Engineering investment
○ Don’t be afraid of having multiple systems
○ Implement your own! (know how and when to Fork)

Airflow Operation vs Business Metrics
Connect Airflow STATSD, great for cluster monitoring!
Use Airflow Trends!
Not sure about inlets/outlets in BaseOperator

Your KEY and x-axes
● Treat your system as a BATCH system, not as 24/7
○ Restarts will happen
○ Scheduling and SLA
● Scheduled time, execution time, restart time

Your KEY name
Is it just a NAME?
Or more like ENV.PROJECT.PIPELINE.CLIENT.TASK_ID ?
● Metrics2.0 to the Rescue! (use labels)
● You’ll have similar tasks in the pipeline!
● You’ll run the same tasks in development!

Alerting
● Data processes are no different from your FrontEnd Customer Facing
○ You MUST monitor and alert!
● Pagerduty! Slack channel!
● (read good practices on alerting)
● …
● Your jobs have Discrete metrics; your Alerting system will not like that!
● Ask your team to develop stable KPIs .

What we didn’t discuss
Cost Monitoring
Advanced Alerting on Data Pipelines
How to reuse Production observability in Development and vice versa

What we did discuss
OBSERVABILITY!
Invest now!
Start from basics: instrument your code!

References
https://airﬂow.apache.org/
https://medium.com/databand-ai/observability-for-data-engineering-a2e826
587205

24
Contact us to learn more and see
the product in action!
www.databand.ai
contact@databand.ai

Data Pipline Observability meetup

More Related Content

What's hot

Similar to Data Pipline Observability meetup

More from Omid Vahdaty

Recently uploaded

Data Pipline Observability meetup