spark stream - kafka - the right way

Jaquet -Fyber
Dori Waldman - Big Data Lead
Eran Shemesh - Big Data Dev

What we want to Achieve:
2
■ Save 5T streaming raw data in s3
■ Push aggregated data to Database
■ Enrich and clean data before push it to Druid
■ Handle late events can be even 1 day window
■ Convert data to parquet format
■ Customers use dashboard on daily basis

3
Spark stream from Json to Parquet S3
Spark batch for clean cardinality , pre-agg , enrich data (K8s)
Pipeline (Batch/Lake)

Jaquet stream challenges
■ Save events in storage (s3) by a certain granularity that will also handles late events
■ Recover from failures / crash (exactly once, no duplication , no data lost)
■ Convert data to parquet format
■ Control data schema changes
■ Compression
4

It's not a new problem ...
5
Vendor Json→ Parquet Exactly Once Comment
Secor (Pinterest) Need to convert json to protobuf Sometimes we lost data Simple kafka consumer cluster
S3-connect (kafka) (qubole needs to convert json to
avro)
Spark (Fyber) 80 rows of code , full control

■ Json to Parquet - a given
■ Data is being read and written in micro batches
■ Data can be read through a schema
■ Web UI
Spark Streaming

So what’s the problem?
■ In Kafka, it’s the consumer’s responsibility to say what messages he wants to consume
■ What happens when the stream fails?
■ After consuming the data, when do I save my Kafka partitions offsets for the next batch?
○ Saving the offsets before writing to S3 can cause skipping events
○ Saving the offsets after writing to S3 can cause duplication
○ Saving the offsets and writing the data to S3 can not happen at once (transactionally)
○
Partition 1, reading offsets [100,250)

Writing each event exactly once - Option A
File names
s3/.../partition1.parquet
Partition name offset
Partition1 250
Partition2 200
Partition3 220
Partition4 300
1
2
Write the oﬀsets to RDBMS

■ If the stream fails during saving the data to S3
○ The stream would start from the same offsets written in the database
○ Duplicated data in S3, but the database contains only the ‘clean’ file paths
○ Can execute a batch job to remove ‘dirty’ files
File names
Partition name offset
Partition1 250
Partition2 200
Partition3 220
Partition4 300

■ So why not?
○ Requires using RDBMS, which adds complication and increases the risk
○ When the stream fails, the data is not reliable until we clean duplications with another
external job

Writing each event exactly once - Option B
s3://.../sum_starting_offsets=480/*.parquet
Combine the offsets with the file path
■ Every micro batch, calculate the sum of the starting offsets
■ Chain the sum of the starting offsets in the destination paths
■ After saving the data to S3, save the ending offsets to Kafka

■ If the stream fails during saving the data to S3
○ The stream would calculate the same sum_starting_oﬀsets
○ The destination folders would be the same as in the failed micro batch
○ When we write the data again, we would overwrite the old ﬁles in S3, causing no
duplications

■ Once the data is written, it can be used
■ Simple. Contains only Kafka, a Spark application and S3
■ So, We chose to go in this direction

Putting the data into the right place
■ Add the sum of the starting oﬀsets to the event
■ Write the data frame partitioned by the date columns and the sum_starting_oﬀsets
S3://bucket/2019/05/13/19/sum_starting_offsets/
S3://bucket/2019/05/13/20/sum_starting_offsets/

■ We wanted to use Spark’s partitioning option

■ But when partitioning data with spark and overwriting the partitions we
want to write to, all the partitions gets removed!

■ Using spark’s hive implementation could have solved this problem, but
this required using a hive metastore (dynamic partition)
(https://stackoverflow.com/questions/38487667/overwrite-specific-partitions-in-spark-dataframe-write-method)
https://medium.com/a-muggles-pensieve/writing-into-dynamic-partitions-using-spark-2e2b818a007a
■ This stream application is the backbone of our data flow, so we wanted
to keep it as simple as we can

■ We decided to do this ourselves
○ Scan the data to infer the destination folders in S3 (Select distinct on the partitioning columns)
○ Delete those folders
○ Use Spark’s partitioning with an ‘Append’ mode
■ Requires more computation time, but much simpler, requires less moving parts

21
■ Json ﬁles is schemaless
■ Any removal or changes in any of the event’s attributes by the producer side, can cause
failures to the data pipeline
■ Enforcing a schema on our side when consuming the events helps us to avoid this
problem
Working with schema

Working with schema
22
With Schema
firstName lastName isAlive
Jon Snow true
Jon Snow null/default
Schema
firstName lastName isAlive isKing
Jon Snow true null
Jon Snow null false
No Schema

Deployment
24
■ Currently runs on a hadoop cluster, using YARN
■ In the near future use kubernetes and spots, for costs reduction

Deployment
25
■ How much resources (executors, CPU, memory)?
○ The minimal amount that would be enough for the stream
○ Enough for closing the gap from Kafka in case of hours of downtime
○ So we tried to ﬁnd the minimal amount of resources that would hold 120 percent
of our daily max event rate

Deployment
26
■ Measured the maximum daily rate of events we get from Kafka, Let’s say this number is
1000 events per second, 120 percent would be 1200 events per second
■ Let’s say we have 120 Kafka partitions
■ Limit spark to consume (1200/120)=10 events per second from each Kafka partition
(spark.streaming.kafka.maxRatePerPartition)
■ Check what is the minimal amount of resources (CPU and memory) that’s enough to
this limit without gaining a lag with the kafka topic (Trial and Error)
■ Use all the machines in the cluster to utilize maximal network IO throughput

Deployment
27
■ How many spark partitions?
○ The common recommandation is having ~ x2 or x3 spark partitions than the
number of cores, for best using all the cores and reduce unbalanced partitions
○ We still went for one to one ratio, to reduce the number of ﬁles in S3

Monitoring and stream auto recovery
28
■ Recover
○ Spark can recover from task failure
○ YARN can recover from stream failure
○ We added watchDog that monitor and start stream if its down
○ Grafana that monitor kafka lag per topic
○ We added onBatchCompleted listener (~keep alive)

29
Works as expected , recover from crash and downtime
Day after deployment

Airflow
30
■ Scheduler
■ Recover from failure
■ UI
■ Each task monitor itself , and autofix if needed including sending atomic alerts per Dag (since airflow 10)

Try Jaquet
32
■ https://medium.com/@eranshemesh/jaquet-saving-your-mass-of-events-b9d1d5f16c5
■ https://github.com/SponsorPay/jaquet

spark stream - kafka - the right way

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to spark stream - kafka - the right way

Similar to spark stream - kafka - the right way (20)

More from Dori Waldman

More from Dori Waldman (9)

Recently uploaded

Recently uploaded (20)

spark stream - kafka - the right way