Challenges in building a Data Pipeline

Manish Singh
Engineer at Hevo
https://linkedin.com/in/manishsingh123/
Challenges in Building a
Data Pipeline

● Data Pipeline
● Possible Implementations
● Challenges
● Data Processing Architectures
Agenda

● Highly scalable
● Highly available
● Low latency
● Zero data loss
● Support for multiple data sources (e.g. MySQL, NoSQL,
Mixpanel, Analytics)
● Instrumentation, monitoring, and alerting
● Real-time vs Batch
Expectations

Stream
● Usages: Live dashboards
(count, average), rate
limiting, triggers
● Processing: Apache Storm,
Apache Spark, Apache
Samza
● Store: Elastic Search, Druid,
Spark SQL, Kafka SQL
Stream vs Batch
Batch
● Batch Processing
and
pre-computation
● Immutable Store: HDFS,
Cassandra, Event Stream to
S3
● Data Warehouse: HBase,
Hive, Redshift, Postgres

● ETL (Extract -> Transform -> Load)
● ELT (Extract -> Load -> Transform)
ETL vs ELT

● Complexity of transformation logic compromises latency
● Hardware systems today are better equipped
● Efficient, reduces load time
● Cost effective in the cloud, less components required
Moving from traditional ETL
to ELT

● Query Source DB and keep offset (ID, Updated timestamp)
● Database change logs (e.g. Mysql Binlogs, MongoDB Oplogs)
Replication Modes

● New fields can be added to a source at any point in time
● Character lengths of String columns in source can increase
● Data Type incompatibility between Source and Destination
● Varying type casting
● Data loss during loads - Power failure, Server failure, Code
bugs, etc
Challenges

● Schema detection cannot be done upfront
● Different documents in a single collection can have a different
set of fields
● Different documents in a single collection can have
incompatible field data types
● Nested objects and arrays with a dynamic structure
Additional Challenges with
NoSQL

● Transformations
● Security (Filter, Hashing)
● Replay Mechanism
● Integrity and Anomaly Detection
● Monitoring and Alerts for failures
● Activity Log
Effective Implementations

● How to beat the CAP theorem by Nathan Marz
● Different layers for stream and batch processing
● Need to manage two different layers of the system
Lambda Architecture

● Questioning the Lambda Architecture by Jay Kreps
● Only stream processing with parallelism
● Set Kafka retention policy
● Reprocess into separate table
● Switch table when done and delete the old one
Kappa Architecture

Thank You
Manish Singh, Hevo
https://linkedin.com/in/manishsingh123/

Challenges in building a Data Pipeline

More Related Content

What's hot

Similar to Challenges in building a Data Pipeline

Recently uploaded

Challenges in building a Data Pipeline

Editor's Notes