SlideShare a Scribd company logo
Jaquet -Fyber
Dori Waldman - Big Data Lead
Eran Shemesh - Big Data Dev
What we want to Achieve:
2
■ Save 5T streaming raw data in s3
■ Push aggregated data to Database
■ Enrich and clean data before push it to Druid
■ Handle late events can be even 1 day window
■ Convert data to parquet format
■ Customers use dashboard on daily basis
3
Spark stream from Json to Parquet S3
Spark batch for clean cardinality , pre-agg , enrich data (K8s)
Pipeline (Batch/Lake)
Jaquet stream challenges
■ Save events in storage (s3) by a certain granularity that will also handles late events
■ Recover from failures / crash (exactly once, no duplication , no data lost)
■ Convert data to parquet format
■ Control data schema changes
■ Compression
4
It's not a new problem ...
5
Vendor Json→ Parquet Exactly Once Comment
Secor (Pinterest) Need to convert json to protobuf Sometimes we lost data Simple kafka consumer cluster
S3-connect (kafka) (qubole needs to convert json to
avro)
Spark (Fyber) 80 rows of code , full control
Implementation
■ Json to Parquet - a given
■ Data is being read and written in micro batches
■ Data can be read through a schema
■ Web UI
Spark Streaming
So what’s the problem?
■ In Kafka, it’s the consumer’s responsibility to say what messages he wants to consume
■ What happens when the stream fails?
■ After consuming the data, when do I save my Kafka partitions offsets for the next batch?
○ Saving the offsets before writing to S3 can cause skipping events
○ Saving the offsets after writing to S3 can cause duplication
○ Saving the offsets and writing the data to S3 can not happen at once (transactionally)
○
Partition 1, reading offsets [100,250)
Partition 3, reading offsets [100,220)
Partition 2, reading offsets [150,200)
Partition 4, reading offsets [130,300)
Writing each event exactly once - Option A
Partition 1, reading offsets [100,250)
Partition 3, reading offsets [100,220)
Partition 2, reading offsets [150,200)
Partition 4, reading offsets [130,300)
File names
s3/.../partition1.parquet
s3/.../partition2.parquet
s3/.../partition3.parquet
s3/.../partition4.parquet
Partition name offset
Partition1 250
Partition2 200
Partition3 220
Partition4 300
s3/.../partition1.parquet
s3/.../partition2.parquet
s3/.../partition3.parquet
s3/.../partition4.parquet
1
2
Write the offsets to RDBMS
Writing each event exactly once - Option A
Write the offsets to RDBMS
■ If the stream fails during saving the data to S3
○ The stream would start from the same offsets written in the database
○ Duplicated data in S3, but the database contains only the ‘clean’ file paths
○ Can execute a batch job to remove ‘dirty’ files
File names
s3/.../partition1.parquet
s3/.../partition2.parquet
s3/.../partition3.parquet
s3/.../partition4.parquet
Partition name offset
Partition1 250
Partition2 200
Partition3 220
Partition4 300
Writing each event exactly once - Option A
Write the offsets to RDBMS
■ So why not?
○ Requires using RDBMS, which adds complication and increases the risk
○ When the stream fails, the data is not reliable until we clean duplications with another
external job
Writing each event exactly once - Option B
s3://.../sum_starting_offsets=480/*.parquet
Combine the offsets with the file path
Partition 1, reading offsets [100,250)
Partition 3, reading offsets [100,220)
Partition 2, reading offsets [150,200)
Partition 4, reading offsets [130,300)
■ Every micro batch, calculate the sum of the starting offsets
■ Chain the sum of the starting offsets in the destination paths
■ After saving the data to S3, save the ending offsets to Kafka
Writing each event exactly once - Option B
s3://.../sum_starting_offsets=480/*.parquet
Combine the offsets with the file path
Partition 1, reading offsets [100,250)
Partition 3, reading offsets [100,220)
Partition 2, reading offsets [150,200)
Partition 4, reading offsets [130,300)
■ If the stream fails during saving the data to S3
○ The stream would calculate the same sum_starting_offsets
○ The destination folders would be the same as in the failed micro batch
○ When we write the data again, we would overwrite the old files in S3, causing no
duplications
Writing each event exactly once - Option B
s3://.../sum_starting_offsets=480/*.parquet
Combine the offsets with the file path
Partition 1, reading offsets [100,250)
Partition 3, reading offsets [100,220)
Partition 2, reading offsets [150,200)
Partition 4, reading offsets [130,300)
■ Once the data is written, it can be used
■ Simple. Contains only Kafka, a Spark application and S3
■ So, We chose to go in this direction
Putting the data into the right place
■ Add the sum of the starting offsets to the event
■ Write the data frame partitioned by the date columns and the sum_starting_offsets
S3://bucket/2019/05/13/19/sum_starting_offsets/
S3://bucket/2019/05/13/20/sum_starting_offsets/
■ We wanted to use Spark’s partitioning option
Putting the data into the right place
Putting the data into the right place
■ But when partitioning data with spark and overwriting the partitions we
want to write to, all the partitions gets removed!
Putting the data into the right place
■ But when partitioning data with spark and overwriting the partitions we
want to write to, all the partitions gets removed!
Putting the data into the right place
■ Using spark’s hive implementation could have solved this problem, but
this required using a hive metastore (dynamic partition)
(https://stackoverflow.com/questions/38487667/overwrite-specific-partitions-in-spark-dataframe-write-method)
https://medium.com/a-muggles-pensieve/writing-into-dynamic-partitions-using-spark-2e2b818a007a
■ This stream application is the backbone of our data flow, so we wanted
to keep it as simple as we can
Putting the data into the right place
■ We decided to do this ourselves
○ Scan the data to infer the destination folders in S3 (Select distinct on the partitioning columns)
○ Delete those folders
○ Use Spark’s partitioning with an ‘Append’ mode
■ Requires more computation time, but much simpler, requires less moving parts
21
■ Json files is schemaless
■ Any removal or changes in any of the event’s attributes by the producer side, can cause
failures to the data pipeline
■ Enforcing a schema on our side when consuming the events helps us to avoid this
problem
Working with schema
Working with schema
22
With Schema
firstName lastName isAlive
Jon Snow true
Jon Snow null/default
Schema
firstName lastName isAlive isKing
Jon Snow true null
Jon Snow null false
No Schema
Deployment
23
Deployment
24
■ Currently runs on a hadoop cluster, using YARN
■ In the near future use kubernetes and spots, for costs reduction
Deployment
25
■ How much resources (executors, CPU, memory)?
○ The minimal amount that would be enough for the stream
○ Enough for closing the gap from Kafka in case of hours of downtime
○ So we tried to find the minimal amount of resources that would hold 120 percent
of our daily max event rate
Deployment
26
■ Measured the maximum daily rate of events we get from Kafka, Let’s say this number is
1000 events per second, 120 percent would be 1200 events per second
■ Let’s say we have 120 Kafka partitions
■ Limit spark to consume (1200/120)=10 events per second from each Kafka partition
(spark.streaming.kafka.maxRatePerPartition)
■ Check what is the minimal amount of resources (CPU and memory) that’s enough to
this limit without gaining a lag with the kafka topic (Trial and Error)
■ Use all the machines in the cluster to utilize maximal network IO throughput
Deployment
27
■ How many spark partitions?
○ The common recommandation is having ~ x2 or x3 spark partitions than the
number of cores, for best using all the cores and reduce unbalanced partitions
○ We still went for one to one ratio, to reduce the number of files in S3
Monitoring and stream auto recovery
28
■ Recover
○ Spark can recover from task failure
○ YARN can recover from stream failure
○ We added watchDog that monitor and start stream if its down
○ Grafana that monitor kafka lag per topic
○ We added onBatchCompleted listener (~keep alive)
29
Works as expected , recover from crash and downtime
Day after deployment
Airflow
30
■ Scheduler
■ Recover from failure
■ UI
■ Each task monitor itself , and autofix if needed including sending atomic alerts per Dag (since airflow 10)
Airflow
31
Try Jaquet
32
■ https://medium.com/@eranshemesh/jaquet-saving-your-mass-of-events-b9d1d5f16c5
■ https://github.com/SponsorPay/jaquet
Thank You

More Related Content

What's hot

Speedment & Sencha at Oracle Open World 2015
Speedment & Sencha at Oracle Open World 2015Speedment & Sencha at Oracle Open World 2015
Speedment & Sencha at Oracle Open World 2015
Speedment, Inc.
 
PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
Johan Gustavsson
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
Ryuji Tamagawa
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
Anant Corporation
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
Rob Emanuele
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
MapR Technologies
 
Graphite
GraphiteGraphite
Graphite
David Lutz
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
Srinath Perera
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layer
Tilak Patidar
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Rob Emanuele
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
Spark Summit
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
Sigmoid
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Rob Emanuele
 
Druid
DruidDruid
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax Academy
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
EDB
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
Jody Garnett
 
Apache cassandra an introduction
Apache cassandra  an introductionApache cassandra  an introduction
Apache cassandra an introduction
Shehaaz Saif
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 

What's hot (20)

Speedment & Sencha at Oracle Open World 2015
Speedment & Sencha at Oracle Open World 2015Speedment & Sencha at Oracle Open World 2015
Speedment & Sencha at Oracle Open World 2015
 
PTD and beyond
PTD and beyondPTD and beyond
PTD and beyond
 
You might be paying too much for BigQuery
You might be paying too much for BigQueryYou might be paying too much for BigQuery
You might be paying too much for BigQuery
 
Cassandra Lunch #59 Functions in Cassandra
Cassandra Lunch #59  Functions in CassandraCassandra Lunch #59  Functions in Cassandra
Cassandra Lunch #59 Functions in Cassandra
 
Q4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis PresentationQ4 2016 GeoTrellis Presentation
Q4 2016 GeoTrellis Presentation
 
Time Series Data in a Time Series World
Time Series Data in a Time Series WorldTime Series Data in a Time Series World
Time Series Data in a Time Series World
 
Graphite
GraphiteGraphite
Graphite
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
 
Building a PII scrubbing layer
Building a PII scrubbing layerBuilding a PII scrubbing layer
Building a PII scrubbing layer
 
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projectsEnabling Access to Big Geospatial Data with LocationTech and Apache projects
Enabling Access to Big Geospatial Data with LocationTech and Apache projects
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
Time series database by Harshil Ambagade
Time series database by Harshil AmbagadeTime series database by Harshil Ambagade
Time series database by Harshil Ambagade
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
Druid
DruidDruid
Druid
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Apache cassandra an introduction
Apache cassandra  an introductionApache cassandra  an introduction
Apache cassandra an introduction
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 

Similar to spark stream - kafka - the right way

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
DataWorks Summit/Hadoop Summit
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
Nezih Yigitbasi
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
Jason Terpko
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
Kai Sasaki
 
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
AWSKRUG - AWS한국사용자모임
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
Steven Wu
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
Dori Waldman
 
Amazon Aurora TechConnect
Amazon Aurora TechConnect Amazon Aurora TechConnect
Amazon Aurora TechConnect
LavanyaMurthy9
 
Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas Deduplication
Michael Hudak
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
Amit Raj
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
Ryousei Takano
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
Kai Sasaki
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
Giovanna Roda
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
Amazon Web Services
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
Antonios Giannopoulos
 

Similar to spark stream - kafka - the right way (20)

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Managing Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDBManaging Data and Operation Distribution In MongoDB
Managing Data and Operation Distribution In MongoDB
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Real World Storage in Treasure Data
Real World Storage in Treasure DataReal World Storage in Treasure Data
Real World Storage in Treasure Data
 
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
Aurora Serverless, 서버리스 RDB의 서막 - 트랙2, Community Day 2018 re:Invent 특집
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
Amazon Aurora TechConnect
Amazon Aurora TechConnect Amazon Aurora TechConnect
Amazon Aurora TechConnect
 
Champion Fas Deduplication
Champion Fas DeduplicationChampion Fas Deduplication
Champion Fas Deduplication
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
User-space Network Processing
User-space Network ProcessingUser-space Network Processing
User-space Network Processing
 
Apache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best PractiseApache Spark Performance tuning and Best Practise
Apache Spark Performance tuning and Best Practise
 
Optimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud StorageOptimizing Presto Connector on Cloud Storage
Optimizing Presto Connector on Cloud Storage
 
Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
Managing data and operation distribution in MongoDB
Managing data and operation distribution in MongoDBManaging data and operation distribution in MongoDB
Managing data and operation distribution in MongoDB
 

More from Dori Waldman

openai.pptx
openai.pptxopenai.pptx
openai.pptx
Dori Waldman
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
Dori Waldman
 
Memcached
MemcachedMemcached
Memcached
Dori Waldman
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
Dori Waldman
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8
Dori Waldman
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
Dori Waldman
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
Dori Waldman
 
Dori waldman android _course_2
Dori waldman android _course_2Dori waldman android _course_2
Dori waldman android _course_2
Dori Waldman
 
Dori waldman android _course
Dori waldman android _courseDori waldman android _course
Dori waldman android _course
Dori Waldman
 

More from Dori Waldman (9)

openai.pptx
openai.pptxopenai.pptx
openai.pptx
 
Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies Machine Learning and Deep Learning 4 dummies
Machine Learning and Deep Learning 4 dummies
 
Memcached
MemcachedMemcached
Memcached
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
whats new in java 8
whats new in java 8 whats new in java 8
whats new in java 8
 
Spark streaming with kafka
Spark streaming with kafkaSpark streaming with kafka
Spark streaming with kafka
 
Spark stream - Kafka
Spark stream - Kafka Spark stream - Kafka
Spark stream - Kafka
 
Dori waldman android _course_2
Dori waldman android _course_2Dori waldman android _course_2
Dori waldman android _course_2
 
Dori waldman android _course
Dori waldman android _courseDori waldman android _course
Dori waldman android _course
 

Recently uploaded

openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
Shane Coughlan
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
Aftab Hussain
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
TheSMSPoint
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
timtebeek1
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
Google
 

Recently uploaded (20)

openEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain SecurityopenEuler Case Study - The Journey to Supply Chain Security
openEuler Case Study - The Journey to Supply Chain Security
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of CodeA Study of Variable-Role-based Feature Enrichment in Neural Models of Code
A Study of Variable-Role-based Feature Enrichment in Neural Models of Code
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Transform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR SolutionsTransform Your Communication with Cloud-Based IVR Solutions
Transform Your Communication with Cloud-Based IVR Solutions
 
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdfAutomated software refactoring with OpenRewrite and Generative AI.pptx.pdf
Automated software refactoring with OpenRewrite and Generative AI.pptx.pdf
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
AI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website CreatorAI Genie Review: World’s First Open AI WordPress Website Creator
AI Genie Review: World’s First Open AI WordPress Website Creator
 

spark stream - kafka - the right way

  • 1. Jaquet -Fyber Dori Waldman - Big Data Lead Eran Shemesh - Big Data Dev
  • 2. What we want to Achieve: 2 ■ Save 5T streaming raw data in s3 ■ Push aggregated data to Database ■ Enrich and clean data before push it to Druid ■ Handle late events can be even 1 day window ■ Convert data to parquet format ■ Customers use dashboard on daily basis
  • 3. 3 Spark stream from Json to Parquet S3 Spark batch for clean cardinality , pre-agg , enrich data (K8s) Pipeline (Batch/Lake)
  • 4. Jaquet stream challenges ■ Save events in storage (s3) by a certain granularity that will also handles late events ■ Recover from failures / crash (exactly once, no duplication , no data lost) ■ Convert data to parquet format ■ Control data schema changes ■ Compression 4
  • 5. It's not a new problem ... 5 Vendor Json→ Parquet Exactly Once Comment Secor (Pinterest) Need to convert json to protobuf Sometimes we lost data Simple kafka consumer cluster S3-connect (kafka) (qubole needs to convert json to avro) Spark (Fyber) 80 rows of code , full control
  • 7. ■ Json to Parquet - a given ■ Data is being read and written in micro batches ■ Data can be read through a schema ■ Web UI Spark Streaming
  • 8. So what’s the problem? ■ In Kafka, it’s the consumer’s responsibility to say what messages he wants to consume ■ What happens when the stream fails? ■ After consuming the data, when do I save my Kafka partitions offsets for the next batch? ○ Saving the offsets before writing to S3 can cause skipping events ○ Saving the offsets after writing to S3 can cause duplication ○ Saving the offsets and writing the data to S3 can not happen at once (transactionally) ○ Partition 1, reading offsets [100,250) Partition 3, reading offsets [100,220) Partition 2, reading offsets [150,200) Partition 4, reading offsets [130,300)
  • 9. Writing each event exactly once - Option A Partition 1, reading offsets [100,250) Partition 3, reading offsets [100,220) Partition 2, reading offsets [150,200) Partition 4, reading offsets [130,300) File names s3/.../partition1.parquet s3/.../partition2.parquet s3/.../partition3.parquet s3/.../partition4.parquet Partition name offset Partition1 250 Partition2 200 Partition3 220 Partition4 300 s3/.../partition1.parquet s3/.../partition2.parquet s3/.../partition3.parquet s3/.../partition4.parquet 1 2 Write the offsets to RDBMS
  • 10. Writing each event exactly once - Option A Write the offsets to RDBMS ■ If the stream fails during saving the data to S3 ○ The stream would start from the same offsets written in the database ○ Duplicated data in S3, but the database contains only the ‘clean’ file paths ○ Can execute a batch job to remove ‘dirty’ files File names s3/.../partition1.parquet s3/.../partition2.parquet s3/.../partition3.parquet s3/.../partition4.parquet Partition name offset Partition1 250 Partition2 200 Partition3 220 Partition4 300
  • 11. Writing each event exactly once - Option A Write the offsets to RDBMS ■ So why not? ○ Requires using RDBMS, which adds complication and increases the risk ○ When the stream fails, the data is not reliable until we clean duplications with another external job
  • 12. Writing each event exactly once - Option B s3://.../sum_starting_offsets=480/*.parquet Combine the offsets with the file path Partition 1, reading offsets [100,250) Partition 3, reading offsets [100,220) Partition 2, reading offsets [150,200) Partition 4, reading offsets [130,300) ■ Every micro batch, calculate the sum of the starting offsets ■ Chain the sum of the starting offsets in the destination paths ■ After saving the data to S3, save the ending offsets to Kafka
  • 13. Writing each event exactly once - Option B s3://.../sum_starting_offsets=480/*.parquet Combine the offsets with the file path Partition 1, reading offsets [100,250) Partition 3, reading offsets [100,220) Partition 2, reading offsets [150,200) Partition 4, reading offsets [130,300) ■ If the stream fails during saving the data to S3 ○ The stream would calculate the same sum_starting_offsets ○ The destination folders would be the same as in the failed micro batch ○ When we write the data again, we would overwrite the old files in S3, causing no duplications
  • 14. Writing each event exactly once - Option B s3://.../sum_starting_offsets=480/*.parquet Combine the offsets with the file path Partition 1, reading offsets [100,250) Partition 3, reading offsets [100,220) Partition 2, reading offsets [150,200) Partition 4, reading offsets [130,300) ■ Once the data is written, it can be used ■ Simple. Contains only Kafka, a Spark application and S3 ■ So, We chose to go in this direction
  • 15. Putting the data into the right place ■ Add the sum of the starting offsets to the event ■ Write the data frame partitioned by the date columns and the sum_starting_offsets S3://bucket/2019/05/13/19/sum_starting_offsets/ S3://bucket/2019/05/13/20/sum_starting_offsets/
  • 16. ■ We wanted to use Spark’s partitioning option Putting the data into the right place
  • 17. Putting the data into the right place ■ But when partitioning data with spark and overwriting the partitions we want to write to, all the partitions gets removed!
  • 18. Putting the data into the right place ■ But when partitioning data with spark and overwriting the partitions we want to write to, all the partitions gets removed!
  • 19. Putting the data into the right place ■ Using spark’s hive implementation could have solved this problem, but this required using a hive metastore (dynamic partition) (https://stackoverflow.com/questions/38487667/overwrite-specific-partitions-in-spark-dataframe-write-method) https://medium.com/a-muggles-pensieve/writing-into-dynamic-partitions-using-spark-2e2b818a007a ■ This stream application is the backbone of our data flow, so we wanted to keep it as simple as we can
  • 20. Putting the data into the right place ■ We decided to do this ourselves ○ Scan the data to infer the destination folders in S3 (Select distinct on the partitioning columns) ○ Delete those folders ○ Use Spark’s partitioning with an ‘Append’ mode ■ Requires more computation time, but much simpler, requires less moving parts
  • 21. 21 ■ Json files is schemaless ■ Any removal or changes in any of the event’s attributes by the producer side, can cause failures to the data pipeline ■ Enforcing a schema on our side when consuming the events helps us to avoid this problem Working with schema
  • 22. Working with schema 22 With Schema firstName lastName isAlive Jon Snow true Jon Snow null/default Schema firstName lastName isAlive isKing Jon Snow true null Jon Snow null false No Schema
  • 24. Deployment 24 ■ Currently runs on a hadoop cluster, using YARN ■ In the near future use kubernetes and spots, for costs reduction
  • 25. Deployment 25 ■ How much resources (executors, CPU, memory)? ○ The minimal amount that would be enough for the stream ○ Enough for closing the gap from Kafka in case of hours of downtime ○ So we tried to find the minimal amount of resources that would hold 120 percent of our daily max event rate
  • 26. Deployment 26 ■ Measured the maximum daily rate of events we get from Kafka, Let’s say this number is 1000 events per second, 120 percent would be 1200 events per second ■ Let’s say we have 120 Kafka partitions ■ Limit spark to consume (1200/120)=10 events per second from each Kafka partition (spark.streaming.kafka.maxRatePerPartition) ■ Check what is the minimal amount of resources (CPU and memory) that’s enough to this limit without gaining a lag with the kafka topic (Trial and Error) ■ Use all the machines in the cluster to utilize maximal network IO throughput
  • 27. Deployment 27 ■ How many spark partitions? ○ The common recommandation is having ~ x2 or x3 spark partitions than the number of cores, for best using all the cores and reduce unbalanced partitions ○ We still went for one to one ratio, to reduce the number of files in S3
  • 28. Monitoring and stream auto recovery 28 ■ Recover ○ Spark can recover from task failure ○ YARN can recover from stream failure ○ We added watchDog that monitor and start stream if its down ○ Grafana that monitor kafka lag per topic ○ We added onBatchCompleted listener (~keep alive)
  • 29. 29 Works as expected , recover from crash and downtime Day after deployment
  • 30. Airflow 30 ■ Scheduler ■ Recover from failure ■ UI ■ Each task monitor itself , and autofix if needed including sending atomic alerts per Dag (since airflow 10)