Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Redis + Apache Spark =
Swiss Army Knife meets
Kitchen Sink
Yeshwanth Vijayakumar
Sr. Engineering Manager/Architect @ Adobe
Agenda
§ Niche 1
▪ Long Running Spark Batch
Job - Dispatch New Jobs by
polling a Redis Queue
§ Niche 2 :
▪ Distributed Cou...
• Niche 1 :
Long Running Spark Batch Job
Problem Context
Run as many queries as possible in parallel on top a
denormalized dataframe
• foo = 1
Query 1
• bar.baz > 120
Query 2
• st...
What do we need?
• Long Running Spark Batch Job
• Dispatch New Jobs by polling a Redis Queue
• We want to parametrize a Sp...
Why not Apache Livy et. al?
Why not Structured Streaming?
• Lack of access to Spark Context within executor context
• Can’t do a spark action on top o...
Working Solution Summary
• Blocking POP on Redis inside driver and use Command
Pattern to send queries to rediscover queue...
Session Workflow – Spark
Continuous Session
10
Submit
Query API
Spark Driver
Executor 1
Executor N
Fetch
Results
Executor ...
Working Solution – Code View
Working Solution – Code View
Working Solution – Code View
• Niche 2 :
Distributed Counters
What is wrong with Accumulators?
• Repeated Task Execution - Non idempotency
• Task Failures and Retries
• Re-using stage ...
What is wrong with Accumulators? - Example
Utilize Redis Hashes as distributed counters
Utilize Redis Hashes as distributed counters
Excellent Throughput
Digging into Redis Pipelining + Spark
From https://redis.io/topics/pipelining
Without Pipelining With Pipelining
Important Config Optimizations
Off-Heap Allocation
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
What to Upload to SlideShare
Next
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Download to read offline

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.



Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue

· Why?

o Custom queries on top a table; We load the data once and query N times

· Why not Structured Streaming

· Working Solution using Redis



Niche 2 : Distributed Counters

· Problems with Spark Accumulators

· Utilize Redis Hashes as distributed counters

· Precautions for retries and speculative execution

· Pipelining to improve performance

  • Be the first to like this

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

  1. 1. Redis + Apache Spark = Swiss Army Knife meets Kitchen Sink Yeshwanth Vijayakumar Sr. Engineering Manager/Architect @ Adobe
  2. 2. Agenda § Niche 1 ▪ Long Running Spark Batch Job - Dispatch New Jobs by polling a Redis Queue § Niche 2 : ▪ Distributed Counters
  3. 3. • Niche 1 : Long Running Spark Batch Job
  4. 4. Problem Context
  5. 5. Run as many queries as possible in parallel on top a denormalized dataframe • foo = 1 Query 1 • bar.baz > 120 Query 2 • state in [CA, NY] Query 3 Query 1000 ProfileIds field1 field1000 eventsArray a@a.com a x [e1,2,3] b@g.com b x [e1] d@d.com d y [e1,2,3] z@z.com z y [e1,2,3,5,7]
  6. 6. What do we need? • Long Running Spark Batch Job • Dispatch New Jobs by polling a Redis Queue • We want to parametrize a Spark Action repeatedly for interactive results • E.g. Submit custom queries on top a table • We load the data once query N times • Bringing up a Spark Cluster per job has a latency cost • Wasted time doing same initialization actions multiple times. • Possible Multi tenancy
  7. 7. Why not Apache Livy et. al?
  8. 8. Why not Structured Streaming? • Lack of access to Spark Context within executor context • Can’t do a spark action on top of dataframe that is already loaded in the driver unless you do a join • Doing a join is extremely limited
  9. 9. Working Solution Summary • Blocking POP on Redis inside driver and use Command Pattern to send queries to rediscover queue • Consume the commands and trigger spark actions using a FAIR scheduler • Communicate status of job through a micro service/database or Redis itself!
  10. 10. Session Workflow – Spark Continuous Session 10 Submit Query API Spark Driver Executor 1 Executor N Fetch Results Executor Logic API 1. POST /preview 2. Check if result in Cache 1. GET /preview/<previewID> 2. Fetch Counters from Redis 3. Push <query> into queue 4. Pop queries till queue is empty [q1, q2, q3, q100] Sample Dataframe Sample Dataframe partition 1 partition 2 partition 1
  11. 11. Working Solution – Code View
  12. 12. Working Solution – Code View
  13. 13. Working Solution – Code View
  14. 14. • Niche 2 : Distributed Counters
  15. 15. What is wrong with Accumulators? • Repeated Task Execution - Non idempotency • Task Failures and Retries • Re-using stage in repeated operations • Speculative Execution • Memory pressure on driver on collect() • Can’t access per partition stats programmatically AFAIK
  16. 16. What is wrong with Accumulators? - Example
  17. 17. Utilize Redis Hashes as distributed counters
  18. 18. Utilize Redis Hashes as distributed counters
  19. 19. Excellent Throughput
  20. 20. Digging into Redis Pipelining + Spark From https://redis.io/topics/pipelining Without Pipelining With Pipelining
  21. 21. Important Config Optimizations Off-Heap Allocation
  22. 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Views

Total views

176

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

16

Shares

0

Comments

0

Likes

0

×