Building real time Data Pipeline using Spark Streaming
Building real time data
Challenges and solutions
● Data Architect @ Intuit
● Built a real time data pipeline
● Deployed the same in production
○ Before record
○ After record
○ Frag number
○ Seq number
○ Table name
○ Shard id
Out of sequence events
● Will I be able to detect it?
● How do I handle it?
○ Single partition kafka topic
○ Multi partition w/ hash partition on PK
○ Read first, before writing
○ Go with EVAT data model w/ change history
● Can we allow delayed events?
● Embrace eventual consistency
○ Eventual is ok
○ Never is not ok
Will maintain the state only for 5 mins.
Is that an option?
Spark streaming (throughput vs latency)
Especially in the context of updating a remote data store
Schema evolves over time
At time t1t2
Does downstream processing fail?
Use Schema registry which supports versioning
When you go live
● It’s essential to bootstrap your system
● Built a bootstrap connector
● Due to huge data load, It takes few mins/hous
● During bootstrap, DB state might be getting changed
So, does it cause data loss?
Enable CDC, before you bootstrap
● Duplicates are okay, but data loss is not okay
● Ensure at least once guarantee
Good to support selective bootstrap
Published corrupted data for past N hour
● Defect in your code
● Some system failure
You can fix the problem & push the fix.
But, will it fix the data retrospectively?
Answer: build replay
● Build replay at every stage of the pipeline
● If not, at least at the very first stage
● Now, how do you build replay?
○ Checkpoint (Topic, partition & offset)
○ Re-start the pipe from given offset
● Spark checkpoints entire DAG (binary)
○ Till which offset it has processed?
○ To replay, Can you set offset to some older value?
● Will you be able to upgrade/re-configure your spark app easily?
● Also, it does auto-ack
Don’t rely on spark checkpointing, build your own
All kafka brokers went down, then?
● We usually re-start them one by one
● Noticed data loss at some topics
Does Kafka lose data?
Diagnosing data issues
● Data loss
● Data corruption
● SLA miss
How do you quickly diagnose the issue?
Diagnosing data issues quickly
● Need a mechanism to track each event uniquely end to end.
● Log aggregation
Batch vs Streaming
● In general, when do you choose to go for streaming?
○ Time critical data
○ Quick decision
● Lot of use cases: 30 mins batch processing will do good
● Both batch & real time streaming on same data