Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Observability in data pipelines

227 views

Published on

What is a data pipeline ?

Responsible for Extraction, Transformation and Load

a) Usually have more than one source of data for Extraction. Eg. a pipeline may be reading data from one CSV file and for each row fetching a corresponding attachment file from another source.

b) Usually have many layers of transformations/processing distributed across different systems. Eg. machine-learning pipelines could cleanse the data in a python process and train the model in a spark process.

c) Usually have more than 1 downstream storage system to load data into. Eg a combination of a Relational DB and ElasticSearch is very common.

Click here to learn more about the Data Pipeline: https://www.imaginea.com/building-data-pipeline-101/

You could also write to us at connect@imaginea.com.




Published in: Technology
  • Be the first to comment

  • Be the first to like this

Observability in data pipelines

  1. 1. #ProductThinking Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 1 Observability in Data Pipelines #ProductThinking
  2. 2. What is observability? Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 2 ● Bringing better visibility into a (distributed) system ● Detecting partial or full failures ● Ability to test in the production environment ● Is directly proportional to the ability to Debug
  3. 3. Embrace failure at all stages Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 3 Design Development Testing Deployment Operations
  4. 4. Why do we need observability ? Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 4 ● Distributed Systems are inherently prone to failure ● A perfectly functioning complex distributed system is not achievable, one can only be prepared to act based on symptoms
  5. 5. How do we achieve observability - the key aspects Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 5
  6. 6. The components of a distributed system - simplified Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 6
  7. 7. What is a data pipeline ? Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 7 ● Usually have more than one source of data for Extraction. Eg. a pipeline may be reading data from one CSV file and for each row fetching a corresponding attachment file from another source ● Usually have many layers of transformations/processing distributed across different systems. Eg. machine learning pipelines could cleanse the data in a python process and train the model in a spark process. ● Usually have more than 1 downstream storage system to load data into. Eg a combination of a Relational DB and ElasticSearch is very common. Responsible for Extraction, Transformation and Load
  8. 8. The need for an orchestrator Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 8 ● Ability to compose/design a pipeline on 1 platform using a standard language ● Ability to integrate with other processes in the architecture ● Ability to create, modify or drop pipelines effortlessly in production ● Ability to scale out Generic Platform Specific Some Considerations
  9. 9. Data pipelines are tricky to test Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 9 ● Testing the system as a whole can only happen in pre-production or sometimes only production ● While full failures are relatively easy to detect, partial failures are usually detected at a later stage in a downstream system ● It is not only about validating the data, it is also about detecting minor deviation in the data set due to a faulty upstream process.
  10. 10. Design considerations for building a testable pipeline Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 10 ● Build a dry-run feature which could run the pipeline against test data sets without persisting data in the downstream system(s) ● Make sure an event or a record is immutable ● Build a circuit breaker type feature which prevents further execution of the pipeline based on a metric ● Identify key system and business metrics to monitor
  11. 11. In data pipelines, monitoring is key Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 11 ● Metrics - detect symptoms based on deviation in metrics, raise an alert ● Tracing - based on the symptom, identify the specific component of the pipeline that caused the issue ● Logging - get the most granular level detail of the affected component The three pillars of monitoring
  12. 12. How do we achieve observability - the key aspects Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 12 Application container Tracing client (Jager) Metrics client (Prometheus client) Logging sidecar (FluentD)
  13. 13. Possible issues in a pipeline Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 13 ● Back pressure in stream consumers ● Failing sub-task of a pipeline (partial failure) ● Spike in pending tasks or processing time ● Job not scheduling due to missed configuration ● Source/upstream system responding too slowly ● Change in format of data (modified column positions, json keys, file structure)
  14. 14. Metrics vs Logs for detecting symptoms Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 14 Metrics Logs Stored in memory at the application level Either streamed to a log file or stdout Well defined data structure depending upon the TSDB Unstructured/application specific structure Low impact on I/O overhead in case of high traffic I/O overhead is directly proportional to traffic TSDBs allow for rich and powerful queries on metrics Querying capability is limited by the indexing capability of the storage system like ElasticSearch. Higher number of metrics is encouraged in production systems It is encouraged to have lower log statements in production systems limiting it only to error scenarios Naturally metrics are more suitable to trigger alerts Logs are more suitable to dig deeper after an error has occurred
  15. 15. Tracing in data pipelines Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 15 ● In this context, tracing loosely translates to lineage of data ● Out of the box frameworks exist for network/request tracing ● Framework and use case specific custom data lineage has to developed
  16. 16. Private and confidential. Copyright (C) 2019, Imaginea Technologies Inc. All rights reserve. 16 Credits Disclaimer This document may contain forward-looking statements concerning products and strategies. These statements are based on management's current expectations and actual results may differ materially from those projected, as a result of certain risks, uncertainties and assumptions, including but not limited to: the growth of the markets addressed by our products and our customers' products, the demand for and market acceptance of our products; our ability to successfully compete in the markets in which we do business; our ability to successfully address the cost structure of our offerings; the ability to develop and implement new technologies and to obtain protection for the related intellectual property; and our ability to realize financial and strategic benefits of past and future transactions. These forward-looking statements are made only as of the date indicated, and the company disclaims any obligation to update or revise the information contained in any forward-looking statements, whether as a result of new information, future events or otherwise. All Trademarks and other registered marks belong to their respective owners. Copyright © 2019, Imaginea Technologies, Inc. and/or its affiliates. All rights reserved. Images under Creative Commons Zero license.

×