Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuous Applications
This presentation discusses how to use Redis and Spark Structured Streaming to process streaming data at scale. The solution breaks down into three functional blocks - data ingest using Redis Streams, data processing using Spark Structured Streaming, and data querying using Spark SQL. Redis Streams are used to ingest streaming click data, Spark Structured Streaming processes the data in micro-batches, and Spark SQL queries the processed data stored as Redis hashes. This combination provides a scalable solution to continuously collect, process, and query data streams in real-time.
WIFI SSID and Password for attendees to access during the Spark AI Summit.
The presentation by Roshan Kumar on how Redis and Structured Streaming integrate for scalable continuous applications.
Introduction to collecting and processing real-time data streams, including IoT, user activity, and messages.
Description of the ClickAnalyzer solution's functional blocks: data ingest, processing, and querying click data.
In-depth explanation of data ingestion using Redis Streams, including command examples and benefits.
Overview of Structured Streaming, its fast, scalable processing capabilities, and operational definitions.
Detailed explanation of the data processing steps using Spark Structured Streaming and Redis Streams.
Steps for querying Redis using Spark SQL, including initializing Spark context, creating tables, and running queries.Recap of the presentation's key points on the ClickAnalyzer and an invitation for feedback and questions.
Breaking up OurSolution into Functional Blocks
Click data
Record all clicks Count clicks in real-time Query clicks by assets
2. Data Processing1. Data Ingest 3. Data Querying
7.
ClickAnalyzer
Redis Stream RedisHash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
The Actual Building Blocks of Our Solution
Click data
ClickAnalyzer
Redis Stream RedisHash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Data Ingest using Redis Streams
Comparing Redis Streamswith Redis Pub/Sub, Lists, Sorted Sets
Pub/Sub
• Fire and forget
• No persistence
• No lookback
queries
Lists
• Tight coupling between
producers and
consumers
• Persistence for
transient data only
• No lookback queries
Sorted
Sets
• Data ordering isn’t built-in;
producer controls the
order
• No maximum limit
• The data structure is not
designed to handle data
streams
14.
What is RedisStreams?
Pub/Sub Lists Sorted Sets
It is like Pub/Sub, but
with persistence
It is like Lists, but decouples
producers and consumers
It is like Sorted Sets,
but asynchronous
+
• Lifecycle management of streaming data
• Built-in support for timeseries data
• A rich choice of options to the consumers to read streaming and static data
• Super fast lookback queries powered by radix trees
• Automatic eviction of data based on the upper limit
15.
Redis Streams Benefits
Itenables asynchronous data exchange between producers and
consumers and historical range queries
ClickAnalyzer
Redis Stream RedisHash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Data Processing using Spark’s Structured Streaming
“Structured Streaming providesfast, scalable, fault-
tolerant, end-to-end exactly-once stream processing
without the user having to reason about streaming.”
Definition
ClickAnalyzer
Redis Stream RedisHashStructured Stream Processing
Redis Streams as data source
Spark-Redis Library
Redis as data sink
§ Developed using Scala
§ Compatible with Spark 2.3 and higher
§ Supports
• RDD
• DataFrames
• Structured Streaming
25.
Redis Streams asData Source
1. Connect to the Redis instance
2. Map Redis Stream to Structured Streaming schema
3. Create the query object
4. Run the query
26.
Code Walkthrough: RedisStreams as Data Source
1. Connect to the Redis instance
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
27.
Code Walkthrough: RedisStreams as Data Source
2. Map Redis Stream to Structured Streaming schema
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
xadd clickstream * img [image_id]
28.
Code Walkthrough: RedisStreams as Data Source
3. Create the query object
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
29.
Code Walkthrough: RedisStreams as Data Source
4. Run the query
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count
val clickWriter: ClickForeachWriter = new ClickForeachWriter("localhost","6379")
val query = queryByImg.writeStream
.outputMode("update")
.foreach(clickWriter)
.start()
query.awaitTermination()
Custom output sink
30.
Redis as OutputSink
override def process(record: Row) = {
var img = record.getString(0);
var count = record.getLong(1);
if(jedis == null){
connect()
}
jedis.hset("clicks:"+img, "img", img)
jedis.hset("clicks:"+img, "count", count.toString)
}
Create a custom class extending ForeachWriter and override the method, process()
Save as Hash with structure
clicks:[image]
img [image]
count [count]
Example
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Table: Clicks
ClickAnalyzer
Redis Stream RedisHash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
Building Blocks of our Solution
Redis Streams as data source; Redis as data sinkSpark-Redis Library is used for: