Data Pipeline Challenges and Architectures

•Download as PPTX, PDF•

0 likes•239 views

The document discusses challenges in building a data pipeline including making it highly scalable, available with low latency and zero data loss while supporting multiple data sources. It covers expectations for real-time vs batch processing and explores stream and batch architectures using tools like Apache Storm, Spark and Kafka. Challenges of data replication, schema detection and transformations with NoSQL are also examined. Effective implementations should include monitoring, security and replay mechanisms. Finally, lambda and kappa architectures for combining stream and batch processing are presented.

Software

Manish Singh
Engineer at Hevo
https://linkedin.com/in/manishsingh123/
Challenges in Building a
Data Pipeline

● Data Pipeline
● Possible Implementations
● Challenges
● Data Processing Architectures
Agenda

● Highly scalable
● Highly available
● Low latency
● Zero data loss
● Support for multiple data sources (e.g. MySQL, NoSQL,
Mixpanel, Analytics)
● Instrumentation, monitoring, and alerting
● Real-time vs Batch
Expectations

Stream
● Usages: Live dashboards
(count, average), rate
limiting, triggers
● Processing: Apache Storm,
Apache Spark, Apache
Samza
● Store: Elastic Search, Druid,
Spark SQL, Kafka SQL
Stream vs Batch
Batch
● Batch Processing
and
pre-computation
● Immutable Store: HDFS,
Cassandra, Event Stream to
S3
● Data Warehouse: HBase,
Hive, Redshift, Postgres

● ETL (Extract -> Transform -> Load)
● ELT (Extract -> Load -> Transform)
ETL vs ELT

● Complexity of transformation logic compromises latency
● Hardware systems today are better equipped
● Efficient, reduces load time
● Cost effective in the cloud, less components required
Moving from traditional ETL
to ELT

● Query Source DB and keep offset (ID, Updated timestamp)
● Database change logs (e.g. Mysql Binlogs, MongoDB Oplogs)
Replication Modes

● New fields can be added to a source at any point in time
● Character lengths of String columns in source can increase
● Data Type incompatibility between Source and Destination
● Varying type casting
● Data loss during loads - Power failure, Server failure, Code
bugs, etc
Challenges

● Schema detection cannot be done upfront
● Different documents in a single collection can have a different
set of fields
● Different documents in a single collection can have
incompatible field data types
● Nested objects and arrays with a dynamic structure
Additional Challenges with
NoSQL

● Transformations
● Security (Filter, Hashing)
● Replay Mechanism
● Integrity and Anomaly Detection
● Monitoring and Alerts for failures
● Activity Log
Effective Implementations

● How to beat the CAP theorem by Nathan Marz
● Different layers for stream and batch processing
● Need to manage two different layers of the system
Lambda Architecture

● Questioning the Lambda Architecture by Jay Kreps
● Only stream processing with parallelism
● Set Kafka retention policy
● Reprocess into separate table
● Switch table when done and delete the old one
Kappa Architecture

Thank You
Manish Singh, Hevo
https://linkedin.com/in/manishsingh123/

What's hot

How Apache Spark and Apache Hadoop are being used to keep banking regulators ...DataWorks Summit

Communication in a Microservice ArchitecturePer Bernhardt

Solution architectureiasaglobal

API Management Solution Powerpoint Presentation SlidesSlideTeam

Microservices Integration Patterns with KafkaKasun Indrasiri

When NOT to use Apache Kafka?Kai Wähner

IT Service Delivery Model OverviewMark Peacock

Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu

Migration PlanningAmazon Web Services

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...confluent

Advanced messaging with Apache ActiveMQdejanb

Apache Kafka® Use Cases for Financial Servicesconfluent

Application Management Service OfferingsGss America

Los beneficios de migrar sus cargas de trabajo de big data a AWSAmazon Web Services LATAM

How API Enablement Drives Legacy ModernizationMuleSoft

Agile Architecture in a Modern Cloud-Native EcosystemCloud Study Network

Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Microservices in the Apache Kafka Ecosystemconfluent

Spark (Structured) Streaming vs. Kafka StreamsGuido Schmutz

What's hot (20)

How Apache Spark and Apache Hadoop are being used to keep banking regulators ...

Communication in a Microservice Architecture

Solution architecture

API Management Solution Powerpoint Presentation Slides

Microservices Integration Patterns with Kafka

When NOT to use Apache Kafka?

IT Service Delivery Model Overview

Battle of the Stream Processing Titans – Flink versus RisingWave

Migration Planning

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...

Advanced messaging with Apache ActiveMQ

Apache Kafka® Use Cases for Financial Services

Application Management Service Offerings

Los beneficios de migrar sus cargas de trabajo de big data a AWS

How API Enablement Drives Legacy Modernization

Agile Architecture in a Modern Cloud-Native Ecosystem

Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...

Netflix Data Pipeline With Kafka

Microservices in the Apache Kafka Ecosystem

Spark (Structured) Streaming vs. Kafka Streams

Similar to Data Pipeline Challenges and Architectures

Cloud Lambda Architecture PatternsAsis Mohanty

JPoint'15 Mom, I so wish Hibernate for my NoSQL database...Alexey Zinoviev

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonSpark Summit

Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal

Big Data_Architecture.pptxbetalab

Introduction to Apache ApexApache Apex

PostgreSQL as an Alternative to MSSQLAlexei Krasner

Kylin and Druid Presentationargonauts007

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016Adrianos Dadis

Data streaming fundamentalsMohammed Fazuluddin

Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel

Drill architecture 20120913jasonfrantz

Cassandra trainingAndrás Fehér

Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson

Azure DocumentDB OverviewAndrew Liu

Glint with Apache SparkVenkata Naga Ravi

Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann

NoSQL.pptxRithikRaj25

HBase introduction talkHayden Marchant

Similar to Data Pipeline Challenges and Architectures (20)

Cloud Lambda Architecture Patterns

JPoint'15 Mom, I so wish Hibernate for my NoSQL database...

Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Spark Concepts - Spark SQL, Graphx, Streaming

Big Data_Architecture.pptx

Introduction to Apache Apex

PostgreSQL as an Alternative to MSSQL

Kylin and Druid Presentation

Big Data Streaming processing using Apache Storm - FOSSCOMM 2016

Data streaming fundamentals

Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure

Drill architecture 20120913

Cassandra training

Streaming Analytics with Spark, Kafka, Cassandra and Akka

Azure DocumentDB Overview

Glint with Apache Spark

Introduction to Apache NiFi dws19 DWS - DC 2019

NoSQL.pptx

HBase introduction talk

Recently uploaded

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

EY_Graph Database Powered SustainabilityNeo4j

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Asset Management Software - InfographicHr365.us smith

Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea

software engineering Chapter 5 System modeling.pptxnada99848

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Recently uploaded (20)

Cloud Management Software Platforms: OpenStack

Automate your Kamailio Test Calls - Kamailio World 2024

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

Implementing Zero Trust strategy with Azure

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...

EY_Graph Database Powered Sustainability

Cloud Data Center Network Construction - IEEE

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Asset Management Software - Infographic

Salesforce Certified Field Service Consultant

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样

software engineering Chapter 5 System modeling.pptx

BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Data Pipeline Challenges and Architectures

1. Manish Singh Engineer at Hevo https://linkedin.com/in/manishsingh123/ Challenges in Building a Data Pipeline

2. ● Data Pipeline ● Possible Implementations ● Challenges ● Data Processing Architectures Agenda

3. ● Highly scalable ● Highly available ● Low latency ● Zero data loss ● Support for multiple data sources (e.g. MySQL, NoSQL, Mixpanel, Analytics) ● Instrumentation, monitoring, and alerting ● Real-time vs Batch Expectations

4. Stream ● Usages: Live dashboards (count, average), rate limiting, triggers ● Processing: Apache Storm, Apache Spark, Apache Samza ● Store: Elastic Search, Druid, Spark SQL, Kafka SQL Stream vs Batch Batch ● Batch Processing and pre-computation ● Immutable Store: HDFS, Cassandra, Event Stream to S3 ● Data Warehouse: HBase, Hive, Redshift, Postgres

5. ● ETL (Extract -> Transform -> Load) ● ELT (Extract -> Load -> Transform) ETL vs ELT

7. ● Complexity of transformation logic compromises latency ● Hardware systems today are better equipped ● Efficient, reduces load time ● Cost effective in the cloud, less components required Moving from traditional ETL to ELT

8. ● Query Source DB and keep offset (ID, Updated timestamp) ● Database change logs (e.g. Mysql Binlogs, MongoDB Oplogs) Replication Modes

9. ● New fields can be added to a source at any point in time ● Character lengths of String columns in source can increase ● Data Type incompatibility between Source and Destination ● Varying type casting ● Data loss during loads - Power failure, Server failure, Code bugs, etc Challenges

10. ● Schema detection cannot be done upfront ● Different documents in a single collection can have a different set of fields ● Different documents in a single collection can have incompatible field data types ● Nested objects and arrays with a dynamic structure Additional Challenges with NoSQL

11. ● Transformations ● Security (Filter, Hashing) ● Replay Mechanism ● Integrity and Anomaly Detection ● Monitoring and Alerts for failures ● Activity Log Effective Implementations

12.

13.

14. ● How to beat the CAP theorem by Nathan Marz ● Different layers for stream and batch processing ● Need to manage two different layers of the system Lambda Architecture

15. Lambda Architecture

16. ● Questioning the Lambda Architecture by Jay Kreps ● Only stream processing with parallelism ● Set Kafka retention policy ● Reprocess into separate table ● Switch table when done and delete the old one Kappa Architecture

17. Kappa Architecture

18. Questions?

19. Thank You Manish Singh, Hevo https://linkedin.com/in/manishsingh123/

Editor's Notes

https://youtu.be/YzAIjEQ75_c?t=6892 Explain Kafka SQL
Yahoo’s Hadoop clusters sorted 1 TB of data in 209 seconds Petabyte sort using Spark in 4 hours
Petabyte sort using Spark in 4 hours
Petabyte sort using Spark in 4 hours
Petabyte sort using Spark in 4 hours
Petabyte sort using Spark in 4 hours
Lambda - 11th Greek letter
Kappa - 10th Greek letter

Data Pipeline Challenges and Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Pipeline Challenges and Architectures

Similar to Data Pipeline Challenges and Architectures (20)

Recently uploaded

Recently uploaded (20)

Data Pipeline Challenges and Architectures

Editor's Notes