Redis Streams plus Spark Structured Streaming

PRESENTED BY
Redis + Spark Structured Streaming:
A Perfect Combination to Scale-out Your Continuous
Applications
Dave Nielsen
Redis Labs

PRESENTED BY
Agenda:
How to collect and process data stream in real-time at scale
IoT
User Activity
Messages

PRESENTED BY
http://bit.ly/spark-redis

PRESENTED BY
Breaking up Our Solution into Functional Blocks
Click data
Record all clicks Count clicks in real-time Query clicks by assets
2. Data Processing1. Data Ingest 3. Data Querying

PRESENTED BY
ClickAnalyzer
Redis Stream Redis Hash Spark SQLStructured Stream Processing
1. Data Ingest 2. Data Processing 3. Data Querying
The Actual Building Blocks of Our Solution
Click data

PRESENTED BY
ClickAnalyzer
Data Ingest using Redis Streams

PRESENTED BY
What is Redis Streams?

PRESENTED BY
Redis Streams in its Simplest Form
ConsumerProducer

PRESENTED BY
Redis Streams can Connect Many Producers and Consumers
Producer 2
Producer m
Producer 1
Producer 3
Consumer 1
Consumer n
Consumer 2
Consumer 3

PRESENTED BY
Comparing Redis Streams with Redis Pub/Sub, Lists, Sorted Sets
Pub/Sub
• Fire and forget
• No persistence
• No lookback queries
Lists
• Tight coupling between
producers and consumers
• Persistence for transient
data only
• No lookback queries
Sorted Sets
• Data ordering isn’t built-in;
producer controls the order
• No maximum limit
• The data structure is not
designed to handle data
streams

PRESENTED BY
What is Redis Streams?
Pub/Sub Lists Sorted Sets
It is like Pub/Sub, but
with persistence
It is like Lists, but decouples
producers and consumers
It is like Sorted Sets, but
asynchronous
+
• Lifecycle management of streaming data
• Built-in support for timeseries data
• A rich choice of options to the consumers to read streaming and static data
• Super fast lookback queries powered by radix trees
• Automatic eviction of data based on the upper limit

PRESENTED BY
Redis Streams Benefits
Analytics
Data Backup
Consumers
Producer
Messaging
It enables asynchronous data exchange between producers and consumers
and historical range queries

PRESENTED BY
Producer
Image Processor
Arrival Rate: 500/sec
Consumption Rate: 500/sec
Image Processor
Image Processor
Image Processor
Image Processor
Redis Stream
With consumer groups, you can scale out and avoid backlogs

PRESENTED BY
Classifier 1
Classifier 2
Classifier n
Consumer Group
XREADGROUP
XREAD
Consumers
Producer 2
Producer m
Producer 1
Producer 3
XADD
XACK
Deep Learning-based
Classification
Analytics
Data Backup
Messaging
Simplify data collection, processing and
distribution to support complex scenarios

PRESENTED BY
Classifier 1
Classifier 2
Classifier n
Consumer Group
XREADGROUP
XREAD
Consumers
Producer 2
Producer m
Producer 1
Producer 3
XADD
XACK
Deep Learning-based
Classification
Analytics
Data Backup
Messaging

PRESENTED BY
Our Ingest Solution
Redis Stream
1. Data Ingest
Command
xadd clickstream * img [image_id]
Sample data
127.0.0.1:6379> xrange clickstream - +
1) 1) "1553536458910-0"
2) 1) ”image_1"
2) "1"
2) 1) "1553536469080-0"
2) 1) ”image_3"
2) "1"
3) 1) "1553536489620-0"
2) 1) ”image_3"
2) "1”
.
.
.
.

PRESENTED BY
2. Data Processing

PRESENTED BY
ClickAnalyzer
Data Processing using Spark’s Structured Streaming

PRESENTED BY
What is Structured Streaming?

PRESENTED BY
“Structured Streaming provides fast, scalable, fault-
tolerant, end-to-end exactly-once stream
processing without the user having to reason about
streaming.”
Definition

PRESENTED BY
How Structured Streaming Works?
Micro-batches as
DataFrames (tables)
Source: Data Stream
DataFrame Operations
Selection: df.select(“xyz”).where(“a > 10”)
Filtering: df.filter(_.a > 10).map(_.b)
Aggregation: df.groupBy(”xyz").count()
Windowing: df.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"”
).count()
Deduplication: df.dropDuplicates("guid")
Output Sink
Spark Structured Streaming

PRESENTED BY
ClickAnalyzer
Redis Stream Redis HashStructured Stream Processing
Redis Streams as data source
Spark-Redis Library
Redis as data sink
 Developed using Scala
 Compatible with Spark 2.3 and higher
 Supports
• RDD
• DataFrames
• Structured Streaming

PRESENTED BY
1. Connect to the Redis instance
2. Map Redis Stream to Structured Streaming schema
3. Create the query object
4. Run the query
Steps for Using Redis Streams as Data Source

PRESENTED BY
Redis Streams as Data Source
1. Connect to the Redis instance
val spark = SparkSession.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val clickstream = spark.readStream
.format("redis")
.option("stream.keys","clickstream")
.schema(StructType(Array(
StructField("img", StringType)
)))
.load()
val queryByImg = clickstream.groupBy("img").count

PRESENTED BY
2. Map Redis Stream to Structured Streaming schema
.master("local[*]")
.getOrCreate()
.format("redis")
)))
.load()
xadd clickstream * img [image_id]

PRESENTED BY
3. Create the query object
.master("local[*]")
.getOrCreate()
.format("redis")
)))
.load()

PRESENTED BY
4. Run the query
.format("redis")
)))
.load()
val clickWriter: ClickForeachWriter = new ClickForeachWriter("localhost","6379")
val query = queryByImg.writeStream
.outputMode("update")
.foreach(clickWriter)
.start()
query.awaitTermination()
Custom output sink

PRESENTED BY
How to Setup Redis as Output Sink
override def process(record: Row) = {
var img = record.getString(0);
var count = record.getLong(1);
if(jedis == null){
connect()
}
jedis.hset("clicks:"+img, "img", img)
jedis.hset("clicks:"+img, "count", count.toString)
}
Create a custom class extending ForeachWriter and override the method, process()
Save as Hash with structure
clicks:[image]
img [image]
count [count]
Example
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Table: Clicks

PRESENTED BY
ClickAnalyzer
Query Redis using Spark SQL

PRESENTED BY
1. Initialize Spark Context with Redis
2. Create table
3. Run Query
Steps to Query Redis using Spark SQL
clicks:image_1001
img image_1001
count 1029
clicks:image_1002
img image_1002
count 392
.
.
.
.
img count
image_1001 1029
image_1002 392
. .
. .
Redis Hash to SQL mapping

PRESENTED BY
1. Initialize
scala> import org.apache.spark.sql.SparkSession
scala> val spark = SparkSession.builder().appName("redis-
test").master("local[*]").config("spark.redis.host","localhost").config("spark.redis.port","6379").getOrCreate()
scala> val sc = spark.sparkContext
scala> import spark.sql
scala> import spark.implicits._
2. Create table
scala> sql("CREATE TABLE IF NOT EXISTS clicks(img STRING, count INT) USING org.apache.spark.sql.redis OPTIONS (table
'clicks’)”)
How to Query Redis using Spark SQL

PRESENTED BY
3. Run Query
scala> sql("select * from clicks").show();
+----------+-----+
| img|count|
+----------+-----+
|image_1001| 1029|
|image_1002| 392|
|. | .|
|. | .|
|. | .|
|. | .|
+----------+-----+
How to Query Redis using Spark SQL

PRESENTED BY
Code
Email dave@redislabs.com for this slide deck
Or download from https://github.com/redislabsdemo/

PRESENTED BY
ClickAnalyzer
Building Blocks of our Solution
Redis Streams as data source; Redis as data sinkSpark-Redis Library is used for:

PRESENTED BY
Questions
?
?
?
?
?
?
?
?
?
?
?

Thank you!
dave@redislabs.com
@davenielsen
Dave Nielsen

Redis Streams plus Spark Structured Streaming

More Related Content

What's hot

Similar to Redis Streams plus Spark Structured Streaming

More from Dave Nielsen

Recently uploaded

Redis Streams plus Spark Structured Streaming

Editor's Notes