Manish Singh
Engineer at Hevo
https://linkedin.com/in/manishsingh123/
Challenges in Building a
Data Pipeline
● Data Pipeline
● Possible Implementations
● Challenges
● Data Processing Architectures
Agenda
● Highly scalable
● Highly available
● Low latency
● Zero data loss
● Support for multiple data sources (e.g. MySQL, NoSQL,
Mixpanel, Analytics)
● Instrumentation, monitoring, and alerting
● Real-time vs Batch
Expectations
Stream
● Usages: Live dashboards
(count, average), rate
limiting, triggers
● Processing: Apache Storm,
Apache Spark, Apache
Samza
● Store: Elastic Search, Druid,
Spark SQL, Kafka SQL
Stream vs Batch
Batch
● Batch Processing
and
pre-computation
● Immutable Store: HDFS,
Cassandra, Event Stream to
S3
● Data Warehouse: HBase,
Hive, Redshift, Postgres
● ETL (Extract -> Transform -> Load)
● ELT (Extract -> Load -> Transform)
ETL vs ELT
● Complexity of transformation logic compromises latency
● Hardware systems today are better equipped
● Efficient, reduces load time
● Cost effective in the cloud, less components required
Moving from traditional ETL
to ELT
● Query Source DB and keep offset (ID, Updated timestamp)
● Database change logs (e.g. Mysql Binlogs, MongoDB Oplogs)
Replication Modes
● New fields can be added to a source at any point in time
● Character lengths of String columns in source can increase
● Data Type incompatibility between Source and Destination
● Varying type casting
● Data loss during loads - Power failure, Server failure, Code
bugs, etc
Challenges
● Schema detection cannot be done upfront
● Different documents in a single collection can have a different
set of fields
● Different documents in a single collection can have
incompatible field data types
● Nested objects and arrays with a dynamic structure
Additional Challenges with
NoSQL
● Transformations
● Security (Filter, Hashing)
● Replay Mechanism
● Integrity and Anomaly Detection
● Monitoring and Alerts for failures
● Activity Log
Effective Implementations
● How to beat the CAP theorem by Nathan Marz
● Different layers for stream and batch processing
● Need to manage two different layers of the system
Lambda Architecture
Lambda Architecture
● Questioning the Lambda Architecture by Jay Kreps
● Only stream processing with parallelism
● Set Kafka retention policy
● Reprocess into separate table
● Switch table when done and delete the old one
Kappa Architecture
Kappa Architecture
Questions?
Thank You
Manish Singh, Hevo
https://linkedin.com/in/manishsingh123/

Challenges in building a Data Pipeline

  • 1.
    Manish Singh Engineer atHevo https://linkedin.com/in/manishsingh123/ Challenges in Building a Data Pipeline
  • 2.
    ● Data Pipeline ●Possible Implementations ● Challenges ● Data Processing Architectures Agenda
  • 3.
    ● Highly scalable ●Highly available ● Low latency ● Zero data loss ● Support for multiple data sources (e.g. MySQL, NoSQL, Mixpanel, Analytics) ● Instrumentation, monitoring, and alerting ● Real-time vs Batch Expectations
  • 4.
    Stream ● Usages: Livedashboards (count, average), rate limiting, triggers ● Processing: Apache Storm, Apache Spark, Apache Samza ● Store: Elastic Search, Druid, Spark SQL, Kafka SQL Stream vs Batch Batch ● Batch Processing and pre-computation ● Immutable Store: HDFS, Cassandra, Event Stream to S3 ● Data Warehouse: HBase, Hive, Redshift, Postgres
  • 5.
    ● ETL (Extract-> Transform -> Load) ● ELT (Extract -> Load -> Transform) ETL vs ELT
  • 7.
    ● Complexity oftransformation logic compromises latency ● Hardware systems today are better equipped ● Efficient, reduces load time ● Cost effective in the cloud, less components required Moving from traditional ETL to ELT
  • 8.
    ● Query SourceDB and keep offset (ID, Updated timestamp) ● Database change logs (e.g. Mysql Binlogs, MongoDB Oplogs) Replication Modes
  • 9.
    ● New fieldscan be added to a source at any point in time ● Character lengths of String columns in source can increase ● Data Type incompatibility between Source and Destination ● Varying type casting ● Data loss during loads - Power failure, Server failure, Code bugs, etc Challenges
  • 10.
    ● Schema detectioncannot be done upfront ● Different documents in a single collection can have a different set of fields ● Different documents in a single collection can have incompatible field data types ● Nested objects and arrays with a dynamic structure Additional Challenges with NoSQL
  • 11.
    ● Transformations ● Security(Filter, Hashing) ● Replay Mechanism ● Integrity and Anomaly Detection ● Monitoring and Alerts for failures ● Activity Log Effective Implementations
  • 14.
    ● How tobeat the CAP theorem by Nathan Marz ● Different layers for stream and batch processing ● Need to manage two different layers of the system Lambda Architecture
  • 15.
  • 16.
    ● Questioning theLambda Architecture by Jay Kreps ● Only stream processing with parallelism ● Set Kafka retention policy ● Reprocess into separate table ● Switch table when done and delete the old one Kappa Architecture
  • 17.
  • 18.
  • 19.
    Thank You Manish Singh,Hevo https://linkedin.com/in/manishsingh123/

Editor's Notes

  • #5 https://youtu.be/YzAIjEQ75_c?t=6892 Explain Kafka SQL
  • #8 Yahoo’s Hadoop clusters sorted 1 TB of data in 209 seconds Petabyte sort using Spark in 4 hours
  • #9 Petabyte sort using Spark in 4 hours
  • #10 Petabyte sort using Spark in 4 hours
  • #11 Petabyte sort using Spark in 4 hours
  • #12 Petabyte sort using Spark in 4 hours
  • #16 Lambda - 11th Greek letter
  • #18 Kappa - 10th Greek letter