SlideShare a Scribd company logo
Building real time data
pipeline
Challenges and solutions
● Data Architect @ Intuit
● Built a real time data pipeline
● Deployed the same in production
Email: Veeramani_Moorthy@intuit.com
Gmail: veeru.moorthy
Linkedin: https://www.linkedin.com/in/veeramani-moorthy-0ab4a72/
● Business Goal
● NRT pipeline Architecture
● Challenges involved & solution
● Metrics & Monitoring Architecture
● Challenges involved & solution
● Q & A
● Build a low latency (1 min SLA) data pipe which can listen to database
changes, transform & publish the final outcome to Salesforce.
● Zero data loss
● Ordering guarantee
Building real time Data Pipeline using Spark Streaming
Technologies used
● Confluent 2.0.1
○ Kafka
○ Schema Registry
○ Kafka connect
○ Zoo keeper
● Spark Streaming 1.6.1
● Datomic 0.9
● DSE 4.2
CDC Schema
● Payload
○ Before record
○ After record
● Header
○ Frag number
○ Seq number
○ Table name
○ Shard id
Out of sequence events
● Will I be able to detect it?
● How do I handle it?
○ Single partition kafka topic
○ Multi partition w/ hash partition on PK
○ Read first, before writing
○ Go with EVAT data model w/ change history
Late Arrival
● Can we allow delayed events?
● Embrace eventual consistency
○ Eventual is ok
○ Never is not ok
Will maintain the state only for 5 mins.
Is that an option?
Spark streaming (throughput vs latency)
Especially in the context of updating a remote data store
● Foreach
● Mapreduce
Schema evolves over time
At time t1t2
Does downstream processing fail?
Use Schema registry which supports versioning
Kafka
Topic
Schema
Registry
When you go live
● It’s essential to bootstrap your system
● Built a bootstrap connector
● Due to huge data load, It takes few mins/hous
● During bootstrap, DB state might be getting changed
So, does it cause data loss?
Enable CDC, before you bootstrap
● Duplicates are okay, but data loss is not okay
● Ensure at least once guarantee
Good to support selective bootstrap
Published corrupted data for past N hour
● Defect in your code
● Mis-configuration
● Some system failure
You can fix the problem & push the fix.
But, will it fix the data retrospectively?
Answer: build replay
● Build replay at every stage of the pipeline
● If not, at least at the very first stage
● Now, how do you build replay?
○ Checkpoint (Topic, partition & offset)
○ Traceability
○ Re-start the pipe from given offset
Spark streaming: checkpointing
Pitfalls
● Spark checkpoints entire DAG (binary)
○ Till which offset it has processed?
○ To replay, Can you set offset to some older value?
● Will you be able to upgrade/re-configure your spark app easily?
● Also, it does auto-ack
Don’t rely on spark checkpointing, build your own
All kafka brokers went down, then?
● We usually re-start them one by one
● Noticed data loss at some topics
Does Kafka lose data?
Kafka broker setup
Kafka broker - Failover scenario
So, if all kafka brokers goes down
Re-start them in the reverse order of failures
Is it good enough?
What if followers are lagging behind?
● Again, this can cause data loss
● Min.insync.replica config to rescue
Kafka connect setup
● Standalone mode
● Distributed mode
Diagnosing data issues
● Data loss
● Data corruption
● SLA miss
How do you quickly diagnose the issue?
Diagnosing data issues quickly
● Need a mechanism to track each event uniquely end to end.
● Log aggregation
Batch vs Streaming
● In general, when do you choose to go for streaming?
○ Time critical data
○ Quick decision
● Lot of use cases: 30 mins batch processing will do good
● Both batch & real time streaming on same data
Batch & Streaming
Metrics & Monitoring Architecture
CDC
Connector
Reconciler
Transforme
r
JMS
Connector
CDC
EBS
Consumer
Audit events
Audit
Streaming
Job
SLA computation
● Source DB timestamp
● Stage timestamp
● SLA = stage TS – source TS
Use NTP to sync to all nodes
Are these the only challenges?
Questions?

More Related Content

What's hot

Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
HostedbyConfluent
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
confluent
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
Yaroslav Tkachenko
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Ingestion file copy using apex
Ingestion   file copy using apexIngestion   file copy using apex
Ingestion file copy using apex
Apache Apex
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 

What's hot (20)

Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life ExampleKafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Spark on yarn
Spark on yarnSpark on yarn
Spark on yarn
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Ingestion file copy using apex
Ingestion   file copy using apexIngestion   file copy using apex
Ingestion file copy using apex
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 

Similar to Building real time Data Pipeline using Spark Streaming

Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
Codemotion
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
Shriya Arora
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
confluent
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
LibbySchulze
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
Codemotion Tel Aviv
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
Krivoy Rog IT Community
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
Shuyi Chen
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
Shubham Tagra
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
Databricks
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Flink Forward
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
camunda services GmbH
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1
Hisham Mardam-Bey
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 

Similar to Building real time Data Pipeline using Spark Streaming (20)

Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®Tips & Tricks for Apache Kafka®
Tips & Tricks for Apache Kafka®
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 
Migrating to Apache Spark at Netflix
Migrating to Apache Spark at NetflixMigrating to Apache Spark at Netflix
Migrating to Apache Spark at Netflix
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Server fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil AhujaServer fleet management using Camunda by Akhil Ahuja
Server fleet management using Camunda by Akhil Ahuja
 
Activity feeds (and more) at mate1
Activity feeds (and more) at mate1Activity feeds (and more) at mate1
Activity feeds (and more) at mate1
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
datamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 

More from datamantra (19)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 

Recently uploaded

Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
janvikumar4133
 
PHENOMENOLOGY and Interpretive phenomenological analysis
PHENOMENOLOGY and Interpretive phenomenological analysisPHENOMENOLOGY and Interpretive phenomenological analysis
PHENOMENOLOGY and Interpretive phenomenological analysis
CharmoliApumKhrime
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
gargtinna79
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
ASISHSABAT3
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
taqyea
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
TARIKU ENDALE
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
LINAT
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
dizzycaye
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
harendmgr
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
sharonblush
 
International Journal of Fuzzy Logic Systems (IJFLS)
International Journal of Fuzzy Logic Systems (IJFLS)International Journal of Fuzzy Logic Systems (IJFLS)
International Journal of Fuzzy Logic Systems (IJFLS)
ijfls
 
Ahrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptxAhrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptx
tylermmo95
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
kevig
 
Machine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentationMachine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentation
RahulS66654
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
palanisamyiiiier
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
gargnatasha985
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
Virni Arrora
 

Recently uploaded (20)

Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
Beautiful Girls Call 9711199171 9711199171 Provide Best And Top Girl Service ...
 
PHENOMENOLOGY and Interpretive phenomenological analysis
PHENOMENOLOGY and Interpretive phenomenological analysisPHENOMENOLOGY and Interpretive phenomenological analysis
PHENOMENOLOGY and Interpretive phenomenological analysis
 
Seamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send MoneySeamlessly Pay Online, Pay In Stores or Send Money
Seamlessly Pay Online, Pay In Stores or Send Money
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
 
Universidad Camilo José Cela degree offer diploma Transcript
Universidad Camilo José Cela  degree offer diploma TranscriptUniversidad Camilo José Cela  degree offer diploma Transcript
Universidad Camilo José Cela degree offer diploma Transcript
 
Supervised Learning (Data Science).pptx
Supervised Learning  (Data Science).pptxSupervised Learning  (Data Science).pptx
Supervised Learning (Data Science).pptx
 
Willis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdfWillis Tower //Sears Tower- Supertall Building .pdf
Willis Tower //Sears Tower- Supertall Building .pdf
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
 
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
Female Service Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Se...
 
Harendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting PortfolioHarendra Singh, AI Strategy and Consulting Portfolio
Harendra Singh, AI Strategy and Consulting Portfolio
 
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
Best Girls Call Navi Mumbai 9930245274 Provide Best And Top Girl Service And ...
 
International Journal of Fuzzy Logic Systems (IJFLS)
International Journal of Fuzzy Logic Systems (IJFLS)International Journal of Fuzzy Logic Systems (IJFLS)
International Journal of Fuzzy Logic Systems (IJFLS)
 
Ahrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptxAhrefs SEO Report Template for Marketer.pptx
Ahrefs SEO Report Template for Marketer.pptx
 
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured DataFine-Tuning of Small/Medium LLMs for Business QA on Structured Data
Fine-Tuning of Small/Medium LLMs for Business QA on Structured Data
 
Machine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentationMachine learning _new.pptx for a presentation
Machine learning _new.pptx for a presentation
 
transgenders community data in india by govt
transgenders community data in india by govttransgenders community data in india by govt
transgenders community data in india by govt
 
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in CityGirls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
Girls Call Vadodara 000XX00000 Provide Best And Top Girl Service And No1 in City
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
Welcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what yourWelcome back to Instagram. Sign in to check out what your
Welcome back to Instagram. Sign in to check out what your
 

Building real time Data Pipeline using Spark Streaming

  • 1. Building real time data pipeline Challenges and solutions
  • 2. ● Data Architect @ Intuit ● Built a real time data pipeline ● Deployed the same in production Email: Veeramani_Moorthy@intuit.com Gmail: veeru.moorthy Linkedin: https://www.linkedin.com/in/veeramani-moorthy-0ab4a72/
  • 3. ● Business Goal ● NRT pipeline Architecture ● Challenges involved & solution ● Metrics & Monitoring Architecture ● Challenges involved & solution ● Q & A
  • 4. ● Build a low latency (1 min SLA) data pipe which can listen to database changes, transform & publish the final outcome to Salesforce. ● Zero data loss ● Ordering guarantee
  • 6. Technologies used ● Confluent 2.0.1 ○ Kafka ○ Schema Registry ○ Kafka connect ○ Zoo keeper ● Spark Streaming 1.6.1 ● Datomic 0.9 ● DSE 4.2
  • 7. CDC Schema ● Payload ○ Before record ○ After record ● Header ○ Frag number ○ Seq number ○ Table name ○ Shard id
  • 8. Out of sequence events ● Will I be able to detect it? ● How do I handle it? ○ Single partition kafka topic ○ Multi partition w/ hash partition on PK ○ Read first, before writing ○ Go with EVAT data model w/ change history
  • 9. Late Arrival ● Can we allow delayed events? ● Embrace eventual consistency ○ Eventual is ok ○ Never is not ok Will maintain the state only for 5 mins. Is that an option?
  • 10. Spark streaming (throughput vs latency) Especially in the context of updating a remote data store ● Foreach ● Mapreduce
  • 11. Schema evolves over time At time t1t2 Does downstream processing fail?
  • 12. Use Schema registry which supports versioning Kafka Topic Schema Registry
  • 13. When you go live ● It’s essential to bootstrap your system ● Built a bootstrap connector ● Due to huge data load, It takes few mins/hous ● During bootstrap, DB state might be getting changed So, does it cause data loss?
  • 14. Enable CDC, before you bootstrap ● Duplicates are okay, but data loss is not okay ● Ensure at least once guarantee Good to support selective bootstrap
  • 15. Published corrupted data for past N hour ● Defect in your code ● Mis-configuration ● Some system failure You can fix the problem & push the fix. But, will it fix the data retrospectively?
  • 16. Answer: build replay ● Build replay at every stage of the pipeline ● If not, at least at the very first stage ● Now, how do you build replay? ○ Checkpoint (Topic, partition & offset) ○ Traceability ○ Re-start the pipe from given offset
  • 18. Pitfalls ● Spark checkpoints entire DAG (binary) ○ Till which offset it has processed? ○ To replay, Can you set offset to some older value? ● Will you be able to upgrade/re-configure your spark app easily? ● Also, it does auto-ack Don’t rely on spark checkpointing, build your own
  • 19. All kafka brokers went down, then? ● We usually re-start them one by one ● Noticed data loss at some topics Does Kafka lose data?
  • 21. Kafka broker - Failover scenario
  • 22. So, if all kafka brokers goes down Re-start them in the reverse order of failures Is it good enough?
  • 23. What if followers are lagging behind? ● Again, this can cause data loss ● Min.insync.replica config to rescue
  • 24. Kafka connect setup ● Standalone mode ● Distributed mode
  • 25. Diagnosing data issues ● Data loss ● Data corruption ● SLA miss How do you quickly diagnose the issue?
  • 26. Diagnosing data issues quickly ● Need a mechanism to track each event uniquely end to end. ● Log aggregation
  • 27. Batch vs Streaming ● In general, when do you choose to go for streaming? ○ Time critical data ○ Quick decision ● Lot of use cases: 30 mins batch processing will do good ● Both batch & real time streaming on same data
  • 29. Metrics & Monitoring Architecture CDC Connector Reconciler Transforme r JMS Connector CDC EBS Consumer Audit events Audit Streaming Job
  • 30. SLA computation ● Source DB timestamp ● Stage timestamp ● SLA = stage TS – source TS
  • 31. Use NTP to sync to all nodes
  • 32. Are these the only challenges?

Editor's Notes

  1. Table, shard, primary key, fragno & seq no
  2. Throughput vs latency
  3. It’s role in schema evolution
  4. https://splunk-cto-prod-search.platform.intuit.net/en-US/app/search/kabini_ca_audit_dashboard?earliest=0&latest=