Data Pipeline
Observability
By Evgeny Shulman
CTO at databand.ai
About Me
● Building Data Pipelines since 2004
● Founding employee at Crosswise (acquired by Oracle Data Cloud)
○ Luigi, Airflow, all possible Spark Deployments, taking ML
pipelining to the extreme.
● CTO at Databand.ai, father of two boys, enjoying both! :)
THE DREAM!
REAL DATA PIPELINES
● TOOLS
○ Spark, tensorflow, python, SQL..
● DATA
○ formats, schemas, versions
○ sources
● COMPLEXITY
○ pipelines of pipelines, wiring..
Real data pipelines often fail, take forever to
rerun and provide no idea of what’s wrong!
Because of... Code Changes
● DAGs changes a lot too!
● CI/CD?!
● A side-effect is somewhere in downstream..
→ OBSERVABILITY NEEDED!
Because of... Data Changes
● Data Schema changes
● Data Quality Changes
● Data Corruptions
→ OBSERVABILITY NEEDED!
Because… Everything Changes!
● Python Dependency resolving: Pandas 1.0 :)
● Clusters
● Internal Libraries
→ OBSERVABILITY NEEDED!
REALITY
9
What we love
● Flexibility
● Scheduling
● Community
● Maturity
Airflow in ML/Data? (What was it built for?)
Does it solve the PROBLEM?
(Yes! And No..)
What’s the Solution?
take 1 - Production Engineering Approach
Production Metrics grafana, kibana
Production Logging loggly, logz.io, others
Production Alerting datadog, zabbix, nagios, grafana alerts
take 2 - Data Science Approach
Experimentation Management by an external system ( mlflow, sacred,
sagemaker, and many-many others)
Encapsulating reporting into Notebooks
take 3 - Data Ops Engineering Approach
Inhouse development
Custom Operators
Data management on its own
Job submission on its own
Validation Operators ( + external frameworks)
take N: Mix and Match
Understand your customer
Measure Everything
● Metrics, metrics, metrics
○ Start from data inputs - 90% of your bugs are somewhere in data
ingestion
○ Spark job/Tensorflow job is not a black box! (collect metrics)
● Build Comparison Methodology
○ Grafana dashboards
○ 1 to 1 compare!
● Develop
○ Data pipelines are a huge Engineering investment
○ Don’t be afraid of having multiple systems
○ Implement your own! (know how and when to Fork)
Airflow Operation vs Business Metrics
Connect Airflow STATSD, great for cluster monitoring!
Use Airflow Trends!
Not sure about inlets/outlets in BaseOperator
Your KEY and x-axes
● Treat your system as a BATCH system, not as 24/7
○ Restarts will happen
○ Scheduling and SLA
● Scheduled time, execution time, restart time
Your KEY name
Is it just a NAME?
Or more like ENV.PROJECT.PIPELINE.CLIENT.TASK_ID ?
● Metrics2.0 to the Rescue! (use labels)
● You’ll have similar tasks in the pipeline!
● You’ll run the same tasks in development!
Alerting
● Data processes are no different from your FrontEnd Customer Facing
○ You MUST monitor and alert!
● Pagerduty! Slack channel!
● (read good practices on alerting)
● …
● Your jobs have Discrete metrics; your Alerting system will not like that!
● Ask your team to develop stable KPIs .
What we didn’t discuss
Cost Monitoring
Advanced Alerting on Data Pipelines
How to reuse Production observability in Development and vice versa
What we did discuss
OBSERVABILITY!
Invest now!
Start from basics: instrument your code!
References
https://airflow.apache.org/
https://medium.com/databand-ai/observability-for-data-engineering-a2e826
587205
24
Contact us to learn more and see
the product in action!
www.databand.ai
contact@databand.ai

Data Pipline Observability meetup

  • 1.
    Data Pipeline Observability By EvgenyShulman CTO at databand.ai
  • 2.
    About Me ● BuildingData Pipelines since 2004 ● Founding employee at Crosswise (acquired by Oracle Data Cloud) ○ Luigi, Airflow, all possible Spark Deployments, taking ML pipelining to the extreme. ● CTO at Databand.ai, father of two boys, enjoying both! :)
  • 3.
  • 4.
    REAL DATA PIPELINES ●TOOLS ○ Spark, tensorflow, python, SQL.. ● DATA ○ formats, schemas, versions ○ sources ● COMPLEXITY ○ pipelines of pipelines, wiring.. Real data pipelines often fail, take forever to rerun and provide no idea of what’s wrong!
  • 5.
    Because of... CodeChanges ● DAGs changes a lot too! ● CI/CD?! ● A side-effect is somewhere in downstream.. → OBSERVABILITY NEEDED!
  • 6.
    Because of... DataChanges ● Data Schema changes ● Data Quality Changes ● Data Corruptions → OBSERVABILITY NEEDED!
  • 7.
    Because… Everything Changes! ●Python Dependency resolving: Pandas 1.0 :) ● Clusters ● Internal Libraries → OBSERVABILITY NEEDED!
  • 8.
  • 9.
    9 What we love ●Flexibility ● Scheduling ● Community ● Maturity Airflow in ML/Data? (What was it built for?)
  • 10.
    Does it solvethe PROBLEM? (Yes! And No..)
  • 11.
  • 12.
    take 1 -Production Engineering Approach Production Metrics grafana, kibana Production Logging loggly, logz.io, others Production Alerting datadog, zabbix, nagios, grafana alerts
  • 13.
    take 2 -Data Science Approach Experimentation Management by an external system ( mlflow, sacred, sagemaker, and many-many others) Encapsulating reporting into Notebooks
  • 14.
    take 3 -Data Ops Engineering Approach Inhouse development Custom Operators Data management on its own Job submission on its own Validation Operators ( + external frameworks)
  • 15.
    take N: Mixand Match Understand your customer
  • 16.
    Measure Everything ● Metrics,metrics, metrics ○ Start from data inputs - 90% of your bugs are somewhere in data ingestion ○ Spark job/Tensorflow job is not a black box! (collect metrics) ● Build Comparison Methodology ○ Grafana dashboards ○ 1 to 1 compare! ● Develop ○ Data pipelines are a huge Engineering investment ○ Don’t be afraid of having multiple systems ○ Implement your own! (know how and when to Fork)
  • 17.
    Airflow Operation vsBusiness Metrics Connect Airflow STATSD, great for cluster monitoring! Use Airflow Trends! Not sure about inlets/outlets in BaseOperator
  • 18.
    Your KEY andx-axes ● Treat your system as a BATCH system, not as 24/7 ○ Restarts will happen ○ Scheduling and SLA ● Scheduled time, execution time, restart time
  • 19.
    Your KEY name Isit just a NAME? Or more like ENV.PROJECT.PIPELINE.CLIENT.TASK_ID ? ● Metrics2.0 to the Rescue! (use labels) ● You’ll have similar tasks in the pipeline! ● You’ll run the same tasks in development!
  • 20.
    Alerting ● Data processesare no different from your FrontEnd Customer Facing ○ You MUST monitor and alert! ● Pagerduty! Slack channel! ● (read good practices on alerting) ● … ● Your jobs have Discrete metrics; your Alerting system will not like that! ● Ask your team to develop stable KPIs .
  • 21.
    What we didn’tdiscuss Cost Monitoring Advanced Alerting on Data Pipelines How to reuse Production observability in Development and vice versa
  • 22.
    What we diddiscuss OBSERVABILITY! Invest now! Start from basics: instrument your code!
  • 23.
  • 24.
    24 Contact us tolearn more and see the product in action! www.databand.ai contact@databand.ai