Redis TLV Meetup
v5 & Streams
@itamarhaber, October 2018
Redis v5
Redis v5
• Released on Wed Oct 17 13:28:26 CEST 2018
• 9.5 years after v1, 15 months of development
• Major feature: Streams
• Other stuff:
– Project Spartacus
– Active Defragmentation v2
– LOLWUT
– ZPOPMIN/MAX and their blocking variants
– Integrated help for subcommands
Georg Ness Schotter vs LOLWUT (performance)
©VictoriaandAlbertMuseum,London
ZPOP - youtube.com/watch?v=Xk4avdjdM-E
Sorted Sets now support List-like pop operations.
● ZPOPMIN - removes and returns the
lowest-ranking member
● ZPOPMAX - same, but for
highest-ranking
● BZPOPMIN, BZPOPMAX -
blocking variants
Integrated help for subcommands, e.g.:
Streams
A data stream is a sequence of elements. Consider:
• Real time sensor readings, e.g. particle colliders
• IoT, e.g. the irrigation of avocado groves
• User activity in an application
• …
• Messages in distributed systems
In the context of data processing...
… one in which the failure of a computer you
didn't even know existed can render your own
computer unusable.
Leslie Lamport
A distributed system is
“
... a model in which components located on
networked computers communicate and
coordinate their actions by passing messages
Distributed Computing, Wikipedia
Includes: client-server, 3/n-tier, peer to peer, SOA,
micro- & nanoservices, FaaS & serverless…
A distributed system is
“
There are only two hard problems in
distributed systems:
2. Exactly-once delivery
1. Guaranteed order of messages
2. Exactly-once delivery
Mathias Verraes, on Twitter
An observation
“
Fact #1: you can choose one and only one:
• At-most-once delivery, i.e. "fire and forget"
• At-least-once delivery, i.e. explicit
acknowledgement sent by the receiver
Fact #2: exactly-once delivery doesn't exist (unless
you change the definition of MSD)
Observation: order is usually important (duh)
Refresher on message delivery semantics
Consider the non-exhaustive list at taskqueues.com
• 17 message brokers, including: Apache Kafka,
NATS, RabbitMQ and Redis
• 17 queue solutions, including: Celery, Kue,
Laravel, Sidekiq, Resque and RQ (<- all these use
Redis as their backend btw ;))
And that's without considering protocol-based
architectures, legacy busses etc...
This isn't exactly a new challenge
Streams: Anatomy
The Log is a storage abstraction that is:
• Append-only, can be truncated
• A sequence of records ordered by time
A Logical Log is:
• Based on a logical offset, e.g. time (vs. bytes)
• (Therefore time range queries)
• Can be made up of data structures (vs. lines)
A Stream is not unlike a (logical) log
A Stream is (also) a storage abstraction, that is
basically an ordered, logical log of messages
(records). These messages are:
● Made up of:
○ Data payload (semi-structured usually)
○ Metadata (e.g. identifier)
● Immutable, once created
● Always added to the end of the stream
So what is a stream?
A producer is a software component that adds
messages (always to the end of) a stream.
A consumer is a software component that reads
messages from a stream (and acts on them). It can
start reading the messages from any arbitrary
offset, or just wait for new ones.
The Stream Players: producers and consumers
A picture I made of a stream
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Producer
Consumer 1
position
Consumer 2
position
"Next"
message
Message 0
1. A component can both be a producer and a
consumer. Either for the same, or between
different streams. It depends.
2. Multiple stream producers can exist. At least one
is usually needed though.
3. A stream can exist without any consumers.
That's kind of pointless though.
Some observations
Streams: Motivation
Consider the alternative, i.e. batch processing.
Besides fast response times (batch=1):
● Scalable (distributed) design
● Loose coupling of components and faults
● Enables building complex-er pipelines
Why (do architects) build stuff on streams?
… that architects like using, like:
● CQRS
● Event sourcing
● Unified (distributed commit) log
● Microservices
● ...
Also, they fit so well with other stuff...
An abstraction that is useful to work with (in
distributed systems).
Capable of trivially addressing message ordering.
Able to provide (depending on the implementation)
ATM and ALO MDS.
An enabler for Stream Processing (e.g. Spark
Streams, Kafka's Stream Processors).
The Stream (for architects) is
Streams: Redis'
Necessity is the mother of invention
There ain't no such thing as a free lunch
The existing (i.e. lists, sorted sets, PubSub) isn't
"good enough" for things like:
• Log/time series-like data patterns
• At-least-once messaging with fan-out
Also Disque, listpacks, radix trees & reading Kafka :)
Why reinvent hot water (in Redis)?
“
“
● Sorted Sets? Memory hungry, no `BZMINPOP` (at
that time ;)), ordering depends on mutable score
and require uniqueness of elements
● Lists? Inefficient access (linear), index
(changeable)-based, and only have queue-like
blocking operations (single consumer)
● PubSub? Fan-out, sure, but only AMO MDS
How do you model "messages" in Redis < v5?
● A project by Salvatore Sanfilippo
● Like Redis, but is
● "a distributed, in-memory, message broker"
● Eventually consistent (AP in CAP terms)
● Last updated: Jan 2016
● Planned to: come back as a Redis module in v6
● Observation: A Stream "API" can also be built on
top of a message broker (see Kafka)
Interjection: What is Disque?
● A 1st-class citizen, a data structure like any other
● The most complex, implementation-wise
● Stores entries
● Is conceptually made up from 3 APIs:
a. Producer
b. Consumer
c. Consumers Group
What is the Redis Stream?
XADD key [MAXLEN ~ n] <ID | *>
<field> <string> [field string…]
> XADD mystream * foo bar baz qaz
1532556560197-42
Time complexity: O(log n)
See https://redis.io/commands/xadd
The Redis Stream Producer API
Every entry has a unique ID that is its logical offset.
The ID is in following format:
<epoch-milliseconds>-<sequence>
Each ID part is a 64-bit unsigned integer
Sequence is for ordering at millisecond scope
When user-provided, has to be bigger than latest.
When not, max(localtime, latest) is used.
The entry ID
Redis' Stream entries are made up of field-value
pairs. Like a Hash.
Unlike a Hash, repeating field names in consecutive
entries are compressed.
Values are not compressed. Yet.
(Time series engines often compress values, with
values being after all just numbers)
The entry itself
The `MAXLEN` subcommand is for that.
The `~` means about, less expensive to use.
The stream is capped by the number of entries.
Not by time frame.
Future regarding that is "yet unclear" - ref:
https://stackoverflow.com/questions/51168273/re
dis-stream-managing-a-time-frame
Side note: capped streams
XLEN key - does exactly that, not very interesting.
X[REV]RANGE key <start | -> < end | +>
[COUNT count] - much more interesting :)
Get a single entry (start = end = ID)
SCAN-like iteration on a stream (IDs inc.), but better
Range (timeframe) queries on a stream
What's in the stream "API"
> XRANGE mystream - +
1) 1) 1532556560197-0
2) 1) "foo"
2) "bar"
3) "baz"
4) "qaz"
A "real" picture of a stream
Yes. No. Maybe.
It can be used for consuming, but that requires the
client constantly polling the stream for new entries.
So generally, no. There's something better for
consumers.
Is X[REV]RANGE the Consumer API?
XREAD [COUNT count]
STREAMS key [key ...] ID [ID ...]
Somewhat like X[REV]RANGE, but:
● Supports multiple streams
● Easier to consume from an offset onwards
(compared to fetching ranges)
● But it is still polling, so...
The Redis Stream Consumer API
XREAD [COUNT count] [BLOCK ms]
STREAMS key [key ...] ID [ID ...]
● Like `BRPOP` (or `BZMINPOP` ;))
● Supports the special `$` ID, i.e. "new messages
since blockage"
● What about message delivery semantics?
The Redis Stream Consumer Blocking API
Like PubSub, it appears to "fire and forget", or
at-most-once delivery for efficient fan-out.
Contrastingly, messages in a stream are stored.
The consumer manages its last read ID, and can
resume from any point.
(And unlike blocking list (and zset :)) operations,
multiple consumers can consume the same stream)
XREAD [BLOCK] message delivery semantics …
A consumer of a stream gets all entries in order,
and will eventually become a bottleneck. Or fail.
Possible workarounds:
• Add a "type" field to each record - that's dumb
• Shard the stream to multiple keys - meh
• Have the consumer dispatch entries as jobs in
queues or messages in a … GOTO 10
The problem with scaling consumers
Consider the Stream.
There needs to be a way for constructing a
high-level/pseudo consumer, such that is made up
of multiple of its instances running in parallel, each
processing a mutually-exclusive subset of the
entries.
Another, high-level, perspective
… allow multiple consumers to cooperate in
processing messages arriving in a stream, so that
each consumers in a given group takes a subset
of the messages.
Shifts the complexity of recovering from consumer
failures and group management to the Redis server
Consumer Groups
“
A group picture (via @antirez)
1. Members are identified
2. New members get only undelivered messages
3. Each message is delivered to only one member
4. A member can only read its messages
5. A member must explicitly acknowledge the
receipt of messages
Observation: Big Brother (Redis) is observing you
Trivia: this is where most of the effort went into
(Consumer) Group membership rules
XREADGROUP GROUP
<groupname> <consumername>
[COUNT count] [BLOCK ms]
STREAMS key [key ...] ID [ID ...]
● consumername is the member's ID
● groupname is the name of the group
● The special `>` ID means "new messages", any
other ID returns the consumer's history
Consumers Group API, #1
XGROUP CREATE <key> <groupname>
<id or $> // Explicit creation!
// And key must exist
XGROUP SETID <key> <id or $>
XGROUP DESTROY <key> <groupname>
XGROUP DELCONSUMER <key>
<groupname> <consumername>
Consumers Group API, #2
One of the internal data structures used.
Tracks which member saw which messages.
● When a new message is delivered, a new entry in
the list is created
● When an "old" message is delivered, the last
delivered timestamp and number of deliveries
counter (for it) are updated
The Pending Entries List (PEL) is
XACK <key> <group> <id> [<id> …]
Acknowledges the receipt of messages.
(that's at-least-once message delivery semantics)
Essentially removes them from the PEL.
Observation: consumername is not required, only
an ID, so anyone can `XACK` pending messages.
Consumers Group API, #3
XPENDING <key> <group>
[<start> <stop> <count>]
[<consumer>]
XCLAIM <key> <group> <consumer>
<min-idle-time>
<id> [<id> …] [MOAR]
CG introspection & handling consumer failures.
Consumers Group API, #4
XINFO <key>
XDEL <key> <id> [<id> …]
XTRIM <key> [MAXLEN [~] <n>]
Streams API, some loose ends
Definitive answer!
Mebbe.
K10xby!!one
Question?
• Introduction to Redis Streams https://redis.io/topics/streams-intro
• The Redis Manifesto https://github.com/antirez/redis/blob/unstable/MANIFESTO
• Salvatore's blog posts http://antirez.com/news/114 and http://antirez.com/news/116
• Salvatore's inaugural Streams demo https://www.youtube.com/watch?v=ELDzy9lCFHQ
• Salvatore's live demo at Redis Day Tel Aviv 2018
https://www.youtube.com/watch?v=qXEyuUxQXZM
• RCP 11 - The stream data type https://github.com/redis/redis-rcp/blob/master/RCP11.md
• Reddit discussion
https://www.reddit.com/r/redis/comments/4mmrgr/stream_data_structure_for_redis_lets
_design_it/
• Hacker News discussion https://news.ycombinator.com/item?id=15384396
• Consumer groups specification
https://gist.github.com/antirez/68e67f3251d10f026861be2d0fe0d2f4
• Consumer groups API https://gist.github.com/antirez/4e7049ce4fce4aa61bf0cfbc3672e64d
& https://gist.github.com/antirez/4e7049ce4fce4aa61bf0cfbc3672e64d
(some) Redis References

Redis v5 & Streams

  • 1.
    Redis TLV Meetup v5& Streams @itamarhaber, October 2018
  • 2.
  • 3.
    Redis v5 • Releasedon Wed Oct 17 13:28:26 CEST 2018 • 9.5 years after v1, 15 months of development • Major feature: Streams • Other stuff: – Project Spartacus – Active Defragmentation v2 – LOLWUT – ZPOPMIN/MAX and their blocking variants – Integrated help for subcommands
  • 4.
    Georg Ness Schottervs LOLWUT (performance) ©VictoriaandAlbertMuseum,London
  • 5.
    ZPOP - youtube.com/watch?v=Xk4avdjdM-E SortedSets now support List-like pop operations. ● ZPOPMIN - removes and returns the lowest-ranking member ● ZPOPMAX - same, but for highest-ranking ● BZPOPMIN, BZPOPMAX - blocking variants
  • 6.
    Integrated help forsubcommands, e.g.:
  • 7.
  • 8.
    A data streamis a sequence of elements. Consider: • Real time sensor readings, e.g. particle colliders • IoT, e.g. the irrigation of avocado groves • User activity in an application • … • Messages in distributed systems In the context of data processing...
  • 9.
    … one inwhich the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport A distributed system is “
  • 10.
    ... a modelin which components located on networked computers communicate and coordinate their actions by passing messages Distributed Computing, Wikipedia Includes: client-server, 3/n-tier, peer to peer, SOA, micro- & nanoservices, FaaS & serverless… A distributed system is “
  • 11.
    There are onlytwo hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery Mathias Verraes, on Twitter An observation “
  • 12.
    Fact #1: youcan choose one and only one: • At-most-once delivery, i.e. "fire and forget" • At-least-once delivery, i.e. explicit acknowledgement sent by the receiver Fact #2: exactly-once delivery doesn't exist (unless you change the definition of MSD) Observation: order is usually important (duh) Refresher on message delivery semantics
  • 13.
    Consider the non-exhaustivelist at taskqueues.com • 17 message brokers, including: Apache Kafka, NATS, RabbitMQ and Redis • 17 queue solutions, including: Celery, Kue, Laravel, Sidekiq, Resque and RQ (<- all these use Redis as their backend btw ;)) And that's without considering protocol-based architectures, legacy busses etc... This isn't exactly a new challenge
  • 14.
  • 15.
    The Log isa storage abstraction that is: • Append-only, can be truncated • A sequence of records ordered by time A Logical Log is: • Based on a logical offset, e.g. time (vs. bytes) • (Therefore time range queries) • Can be made up of data structures (vs. lines) A Stream is not unlike a (logical) log
  • 16.
    A Stream is(also) a storage abstraction, that is basically an ordered, logical log of messages (records). These messages are: ● Made up of: ○ Data payload (semi-structured usually) ○ Metadata (e.g. identifier) ● Immutable, once created ● Always added to the end of the stream So what is a stream?
  • 17.
    A producer isa software component that adds messages (always to the end of) a stream. A consumer is a software component that reads messages from a stream (and acts on them). It can start reading the messages from any arbitrary offset, or just wait for new ones. The Stream Players: producers and consumers
  • 18.
    A picture Imade of a stream 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Producer Consumer 1 position Consumer 2 position "Next" message Message 0
  • 19.
    1. A componentcan both be a producer and a consumer. Either for the same, or between different streams. It depends. 2. Multiple stream producers can exist. At least one is usually needed though. 3. A stream can exist without any consumers. That's kind of pointless though. Some observations
  • 20.
  • 21.
    Consider the alternative,i.e. batch processing. Besides fast response times (batch=1): ● Scalable (distributed) design ● Loose coupling of components and faults ● Enables building complex-er pipelines Why (do architects) build stuff on streams?
  • 22.
    … that architectslike using, like: ● CQRS ● Event sourcing ● Unified (distributed commit) log ● Microservices ● ... Also, they fit so well with other stuff...
  • 23.
    An abstraction thatis useful to work with (in distributed systems). Capable of trivially addressing message ordering. Able to provide (depending on the implementation) ATM and ALO MDS. An enabler for Stream Processing (e.g. Spark Streams, Kafka's Stream Processors). The Stream (for architects) is
  • 24.
  • 25.
    Necessity is themother of invention There ain't no such thing as a free lunch The existing (i.e. lists, sorted sets, PubSub) isn't "good enough" for things like: • Log/time series-like data patterns • At-least-once messaging with fan-out Also Disque, listpacks, radix trees & reading Kafka :) Why reinvent hot water (in Redis)? “ “
  • 26.
    ● Sorted Sets?Memory hungry, no `BZMINPOP` (at that time ;)), ordering depends on mutable score and require uniqueness of elements ● Lists? Inefficient access (linear), index (changeable)-based, and only have queue-like blocking operations (single consumer) ● PubSub? Fan-out, sure, but only AMO MDS How do you model "messages" in Redis < v5?
  • 27.
    ● A projectby Salvatore Sanfilippo ● Like Redis, but is ● "a distributed, in-memory, message broker" ● Eventually consistent (AP in CAP terms) ● Last updated: Jan 2016 ● Planned to: come back as a Redis module in v6 ● Observation: A Stream "API" can also be built on top of a message broker (see Kafka) Interjection: What is Disque?
  • 28.
    ● A 1st-classcitizen, a data structure like any other ● The most complex, implementation-wise ● Stores entries ● Is conceptually made up from 3 APIs: a. Producer b. Consumer c. Consumers Group What is the Redis Stream?
  • 29.
    XADD key [MAXLEN~ n] <ID | *> <field> <string> [field string…] > XADD mystream * foo bar baz qaz 1532556560197-42 Time complexity: O(log n) See https://redis.io/commands/xadd The Redis Stream Producer API
  • 30.
    Every entry hasa unique ID that is its logical offset. The ID is in following format: <epoch-milliseconds>-<sequence> Each ID part is a 64-bit unsigned integer Sequence is for ordering at millisecond scope When user-provided, has to be bigger than latest. When not, max(localtime, latest) is used. The entry ID
  • 31.
    Redis' Stream entriesare made up of field-value pairs. Like a Hash. Unlike a Hash, repeating field names in consecutive entries are compressed. Values are not compressed. Yet. (Time series engines often compress values, with values being after all just numbers) The entry itself
  • 32.
    The `MAXLEN` subcommandis for that. The `~` means about, less expensive to use. The stream is capped by the number of entries. Not by time frame. Future regarding that is "yet unclear" - ref: https://stackoverflow.com/questions/51168273/re dis-stream-managing-a-time-frame Side note: capped streams
  • 33.
    XLEN key -does exactly that, not very interesting. X[REV]RANGE key <start | -> < end | +> [COUNT count] - much more interesting :) Get a single entry (start = end = ID) SCAN-like iteration on a stream (IDs inc.), but better Range (timeframe) queries on a stream What's in the stream "API"
  • 34.
    > XRANGE mystream- + 1) 1) 1532556560197-0 2) 1) "foo" 2) "bar" 3) "baz" 4) "qaz" A "real" picture of a stream
  • 35.
    Yes. No. Maybe. Itcan be used for consuming, but that requires the client constantly polling the stream for new entries. So generally, no. There's something better for consumers. Is X[REV]RANGE the Consumer API?
  • 36.
    XREAD [COUNT count] STREAMSkey [key ...] ID [ID ...] Somewhat like X[REV]RANGE, but: ● Supports multiple streams ● Easier to consume from an offset onwards (compared to fetching ranges) ● But it is still polling, so... The Redis Stream Consumer API
  • 37.
    XREAD [COUNT count][BLOCK ms] STREAMS key [key ...] ID [ID ...] ● Like `BRPOP` (or `BZMINPOP` ;)) ● Supports the special `$` ID, i.e. "new messages since blockage" ● What about message delivery semantics? The Redis Stream Consumer Blocking API
  • 38.
    Like PubSub, itappears to "fire and forget", or at-most-once delivery for efficient fan-out. Contrastingly, messages in a stream are stored. The consumer manages its last read ID, and can resume from any point. (And unlike blocking list (and zset :)) operations, multiple consumers can consume the same stream) XREAD [BLOCK] message delivery semantics …
  • 39.
    A consumer ofa stream gets all entries in order, and will eventually become a bottleneck. Or fail. Possible workarounds: • Add a "type" field to each record - that's dumb • Shard the stream to multiple keys - meh • Have the consumer dispatch entries as jobs in queues or messages in a … GOTO 10 The problem with scaling consumers
  • 40.
    Consider the Stream. Thereneeds to be a way for constructing a high-level/pseudo consumer, such that is made up of multiple of its instances running in parallel, each processing a mutually-exclusive subset of the entries. Another, high-level, perspective
  • 41.
    … allow multipleconsumers to cooperate in processing messages arriving in a stream, so that each consumers in a given group takes a subset of the messages. Shifts the complexity of recovering from consumer failures and group management to the Redis server Consumer Groups “
  • 42.
    A group picture(via @antirez)
  • 43.
    1. Members areidentified 2. New members get only undelivered messages 3. Each message is delivered to only one member 4. A member can only read its messages 5. A member must explicitly acknowledge the receipt of messages Observation: Big Brother (Redis) is observing you Trivia: this is where most of the effort went into (Consumer) Group membership rules
  • 44.
    XREADGROUP GROUP <groupname> <consumername> [COUNTcount] [BLOCK ms] STREAMS key [key ...] ID [ID ...] ● consumername is the member's ID ● groupname is the name of the group ● The special `>` ID means "new messages", any other ID returns the consumer's history Consumers Group API, #1
  • 45.
    XGROUP CREATE <key><groupname> <id or $> // Explicit creation! // And key must exist XGROUP SETID <key> <id or $> XGROUP DESTROY <key> <groupname> XGROUP DELCONSUMER <key> <groupname> <consumername> Consumers Group API, #2
  • 46.
    One of theinternal data structures used. Tracks which member saw which messages. ● When a new message is delivered, a new entry in the list is created ● When an "old" message is delivered, the last delivered timestamp and number of deliveries counter (for it) are updated The Pending Entries List (PEL) is
  • 47.
    XACK <key> <group><id> [<id> …] Acknowledges the receipt of messages. (that's at-least-once message delivery semantics) Essentially removes them from the PEL. Observation: consumername is not required, only an ID, so anyone can `XACK` pending messages. Consumers Group API, #3
  • 48.
    XPENDING <key> <group> [<start><stop> <count>] [<consumer>] XCLAIM <key> <group> <consumer> <min-idle-time> <id> [<id> …] [MOAR] CG introspection & handling consumer failures. Consumers Group API, #4
  • 49.
    XINFO <key> XDEL <key><id> [<id> …] XTRIM <key> [MAXLEN [~] <n>] Streams API, some loose ends
  • 50.
  • 51.
    • Introduction toRedis Streams https://redis.io/topics/streams-intro • The Redis Manifesto https://github.com/antirez/redis/blob/unstable/MANIFESTO • Salvatore's blog posts http://antirez.com/news/114 and http://antirez.com/news/116 • Salvatore's inaugural Streams demo https://www.youtube.com/watch?v=ELDzy9lCFHQ • Salvatore's live demo at Redis Day Tel Aviv 2018 https://www.youtube.com/watch?v=qXEyuUxQXZM • RCP 11 - The stream data type https://github.com/redis/redis-rcp/blob/master/RCP11.md • Reddit discussion https://www.reddit.com/r/redis/comments/4mmrgr/stream_data_structure_for_redis_lets _design_it/ • Hacker News discussion https://news.ycombinator.com/item?id=15384396 • Consumer groups specification https://gist.github.com/antirez/68e67f3251d10f026861be2d0fe0d2f4 • Consumer groups API https://gist.github.com/antirez/4e7049ce4fce4aa61bf0cfbc3672e64d & https://gist.github.com/antirez/4e7049ce4fce4aa61bf0cfbc3672e64d (some) Redis References