Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Pipline Observability meetup

75 views

Published on

Data Pipeline Observability

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Pipline Observability meetup

  1. 1. Data Pipeline Observability By Evgeny Shulman CTO at databand.ai
  2. 2. About Me ● Building Data Pipelines since 2004 ● Founding employee at Crosswise (acquired by Oracle Data Cloud) ○ Luigi, Airflow, all possible Spark Deployments, taking ML pipelining to the extreme. ● CTO at Databand.ai, father of two boys, enjoying both! :)
  3. 3. THE DREAM!
  4. 4. REAL DATA PIPELINES ● TOOLS ○ Spark, tensorflow, python, SQL.. ● DATA ○ formats, schemas, versions ○ sources ● COMPLEXITY ○ pipelines of pipelines, wiring.. Real data pipelines often fail, take forever to rerun and provide no idea of what’s wrong!
  5. 5. Because of... Code Changes ● DAGs changes a lot too! ● CI/CD?! ● A side-effect is somewhere in downstream.. → OBSERVABILITY NEEDED!
  6. 6. Because of... Data Changes ● Data Schema changes ● Data Quality Changes ● Data Corruptions → OBSERVABILITY NEEDED!
  7. 7. Because… Everything Changes! ● Python Dependency resolving: Pandas 1.0 :) ● Clusters ● Internal Libraries → OBSERVABILITY NEEDED!
  8. 8. REALITY
  9. 9. 9 What we love ● Flexibility ● Scheduling ● Community ● Maturity Airflow in ML/Data? (What was it built for?)
  10. 10. Does it solve the PROBLEM? (Yes! And No..)
  11. 11. What’s the Solution?
  12. 12. take 1 - Production Engineering Approach Production Metrics grafana, kibana Production Logging loggly, logz.io, others Production Alerting datadog, zabbix, nagios, grafana alerts
  13. 13. take 2 - Data Science Approach Experimentation Management by an external system ( mlflow, sacred, sagemaker, and many-many others) Encapsulating reporting into Notebooks
  14. 14. take 3 - Data Ops Engineering Approach Inhouse development Custom Operators Data management on its own Job submission on its own Validation Operators ( + external frameworks)
  15. 15. take N: Mix and Match Understand your customer
  16. 16. Measure Everything ● Metrics, metrics, metrics ○ Start from data inputs - 90% of your bugs are somewhere in data ingestion ○ Spark job/Tensorflow job is not a black box! (collect metrics) ● Build Comparison Methodology ○ Grafana dashboards ○ 1 to 1 compare! ● Develop ○ Data pipelines are a huge Engineering investment ○ Don’t be afraid of having multiple systems ○ Implement your own! (know how and when to Fork)
  17. 17. Airflow Operation vs Business Metrics Connect Airflow STATSD, great for cluster monitoring! Use Airflow Trends! Not sure about inlets/outlets in BaseOperator
  18. 18. Your KEY and x-axes ● Treat your system as a BATCH system, not as 24/7 ○ Restarts will happen ○ Scheduling and SLA ● Scheduled time, execution time, restart time
  19. 19. Your KEY name Is it just a NAME? Or more like ENV.PROJECT.PIPELINE.CLIENT.TASK_ID ? ● Metrics2.0 to the Rescue! (use labels) ● You’ll have similar tasks in the pipeline! ● You’ll run the same tasks in development!
  20. 20. Alerting ● Data processes are no different from your FrontEnd Customer Facing ○ You MUST monitor and alert! ● Pagerduty! Slack channel! ● (read good practices on alerting) ● … ● Your jobs have Discrete metrics; your Alerting system will not like that! ● Ask your team to develop stable KPIs .
  21. 21. What we didn’t discuss Cost Monitoring Advanced Alerting on Data Pipelines How to reuse Production observability in Development and vice versa
  22. 22. What we did discuss OBSERVABILITY! Invest now! Start from basics: instrument your code!
  23. 23. References https://airflow.apache.org/ https://medium.com/databand-ai/observability-for-data-engineering-a2e826 587205
  24. 24. 24 Contact us to learn more and see the product in action! www.databand.ai contact@databand.ai

×