This document provides an overview of AWS Kinesis and its components for streaming data. It describes Amazon Kinesis Streams for processing real-time streaming data at large scale. Key concepts explained include shards, data records, partition keys, sequence numbers, and resharding streams. It also covers the Amazon Kinesis Producer Library, Amazon Kinesis Client Library, and how to handle failures and duplicate records. Amazon Kinesis Firehose and Kinesis Analytics are introduced for loading and analyzing streaming data. Comparisons are made between Kinesis and other AWS services like DynamoDB Streams, SQS, and Kafka.
3. Table of Contents
Streaming data?
Big Data Processing Approaches
AWS Kinesis Family
Amazon Kinesis Streams in detail
Amazon Kinesis Firehose
Amazon Kinesis Analytics
4. Streaming Data: Life As It Happens
After the event occurs -> at rest (batch)
As the event occurs -> in motion (streaming)
5. Big Data Processing Approaches
• Common Big Data Processing Approaches
• Query Engine Approach (Data Warehouse, SQL, NoSQL Databases)
• Repeated queries over the same well-structured data
• Pre-computations like indices and dimensional views improve query performance
• Batch Engines (Map-Reduce)
• The “query” is run on the data. There are no pre-computations
• Streaming Big Data Processing Approach
• Real-time response to content in semi-structured data streams
• Relatively simple computations on data (aggregates, filters, sliding window, etc.)
• Enables data lifecycle by moving data to different stores / open source systems
7. Amazon Kinesis Streams
• A fully managed service for real-time processing of
high- volume, streaming data.
• Kinesis can store and process terabytes of data an
hour from hundreds of thousands of sources.
• Data is replicated across multiple Availability Zones
to ensure high durability and availability.
9. Shard
• Streams are made of Shards. A shard is the base
throughput unit of an Amazon Kinesis stream.
• One shard provides a capacity of 1MB/sec data input
and 2MB/sec data output.
• One shard can support up to 1000 PUT records per
second.
• You can monitor shard-level metrics in Amazon Kinesis
Streams
• Add or remove shards from your stream dynamically
as your data throughput changes by resharding the
stream.
10. Data Record
• A record is the unit of data stored in an Amazon Kinesis stream.
• A record is composed of a;
• partition key
• sequence number,
• data blob (the data you want to send)
• The maximum size of a data blob (the data payload after Base64-
decoding) is 1 megabyte (MB).
11. Partition Key
• Partition key is used to segregate and route data records to different
shards of a stream.
• A partition key is specified by your data producer while putting data
into an Amazon Kinesis stream.
• For example, assuming you have an Amazon Kinesis stream with two
shards (Shard 1 and Shard 2). You can configure your data producer
to use two partition keys (Key A and Key B) so that all data records
with Key A are added to Shard 1 and all data records with Key B are
added to Shard 2.
12. Sequence Number
• Each data record has a sequence number that is unique within its
shard.
• The sequence number is assigned by Streams after you write to the
stream with client.putRecords or client.putRecord.
• Sequence numbers for the same partition key generally increase over
time; the longer the time period between write requests, the larger the
sequence numbers become.
13. Resharding the Stream
• Streams supports resharding, which enables you to adjust the number of
shards in your stream in order to adapt to changes in the rate of data flow
through the stream.
• There are two types of resharding operations: shard split and shard
merge.
• Shard split: divide a single shard into two shards.
• Shard merge: combine two shards into a single shard.
14. Resharding the Stream
• Resharding is always “pairwise”: split into & merge more than two shards
in a single operation is NOT allowed
• Resharding is typically performed by an administrative application which
is distinct from the producer (put) applications, and the consumer (get)
applications
• The administrative application would also need a broader set of IAM
permissions for resharding
15. Splitting a Shard
• Specify how hash key values from the parent shard should be redistributed to the child shards
• The possible hash key values for a given shard constitute a set of ordered contiguous non-
negative integers. This range of possible hash key values is given by
shard.getHashKeyRange().getStartingHashKey();
shard.getHashKeyRange().getEndingHashKey();
• When you split the shard, you specify a value in this range.
• That hash key value and all higher hash key values are distributed to one of the child shards.
• All the lower hash key values are distributed to the other child shard.
16. Merging Two Shards
• In order to merge two shards, the shards must be adjacent.
• Two shards are considered adjacent if the union of the hash key ranges
for the two shards form a contiguous set with no gaps.
• To identify shards that are candidates for merging, you should filter out all
shards that are in a CLOSED state.
• Shards that are OPEN—that is, not CLOSED—have an ending sequence
number of null.
17. After Resharding
• After you call a resharding operation, either splitShard or mergeShards,
you need to wait for the stream to become active again. (like create)
• In the process of resharding, a parent shard transitions from an OPEN
state to a CLOSED state to an EXPIRED state.
• When all is done back to ACTIVE state.
18. Retention Period
• Data records are accessible for a default of 24 hours from the
time they are added to a stream
• Configurable in hourly increments
• From 24 to 168 hours (1 to 7 days)
19. Amazon Kinesis Producer Library (KPL)
• The KPL is an easy-to-use, highly configurable library that helps you
write to a Amazon Kinesis stream.
• Writes to one or more Amazon Kinesis streams with an automatic and configurable
retry mechanism
• Collects records and uses PutRecords to write multiple records to multiple shards
per request
• Aggregates user records to increase payload size and improve throughput
• Integrates seamlessly with the Amazon Kinesis Client Library (KCL) to de-aggregate
batched records on the consumer
• Submits Amazon CloudWatch metrics on your behalf to provide visibility into
producer performance
20. • Develop a consumer application for Amazon Kinesis Streams
• The KCL acts as an intermediary between your record processing logic and
Streams.
• KCL application instantiates a worker with configuration information, and then
uses a record processor to process the data received from an Amazon Kinesis
stream.
• You can run a KCL application on any number of instances. Multiple instances
of the same application coordinate on failures and load-balance dynamically.
• You can also have multiple KCL applications working on the same stream,
subject to throughput limits.
Amazon Kinesis Client Library (Life Saver)
21. Amazon Kinesis Client Library
• Connects to the stream
• Enumerates the shards
• Coordinates shard associations with other workers (if any)
• Instantiates a record processor for every shard it manages
• Pulls data records from the stream
• Pushes the records to the corresponding record processor
• Checkpoints processed records
• Balances shard-worker associations when the worker instance count changes
• Balances shard-worker associations when shards are split or merged
22. Amazon Kinesis Client Library
• KCL uses a unique Amazon DynamoDB table to keep
track of the application's state
• KCL creates the table with a provisioned throughput of
10 reads per second and 10 writes per second
• Each row in the DynamoDB table represents a shard that
is being processed by your application. The hash key for
the table is the shard ID.
23. Amazon Kinesis Client Library
• In addition to the shard ID, each row also includes the following data:
• checkpoint: The most recent checkpoint sequence number for the shard. This value is unique across
all shards in the stream.
• checkpointSubSequenceNumber: When using the Kinesis Producer Library's aggregation feature,
this is an extension to checkpoint that tracks individual user records within the Amazon Kinesis record.
• leaseCounter: Used for lease versioning so that workers can detect that their lease has been taken by
another worker.
• leaseKey: A unique identifier for a lease. Each lease is particular to a shard in the stream and is held
by one worker at a time.
• leaseOwner: The worker that is holding this lease.
• ownerSwitchesSinceCheckpoint: How many times this lease has changed workers since the last
time a checkpoint was written.
• parentShardId: Used to ensure that the parent shard is fully processed before processing starts on
the child shards. This ensures that records are processed in the same order they were put into the
stream.
24. Using Shard Iterators
• You retrieve records from the stream on a per-
shard basis.
• AT_SEQUENCE_NUMBER
• AFTER_SEQUENCE_NUMBER
• AT_TIMESTAMP
• TRIM_HORIZON
• LATEST
25. Recovering from Failures
• Record Processor Failure
• The worker invokes record processor methods using Java ExecutorService tasks.
• If a task fails, the worker retains control of the shard that the record processor was
processing.
• The worker starts a new record processor task to process that shard
• Worker or Application Failure
• If a worker — or an instance of the Amazon Kinesis Streams application — fails,
you should detect and handle the situation.
26. Handling Duplicate Records
(Idempotency)
• There are two primary reasons why records may be
delivered more than one time to your Amazon
Kinesis Streams application:
• producer retries
• consumer retries
• Your application must anticipate and appropriately
handle processing individual records multiple times.
27. Pricing
• Shard Hour (1MB/second ingress, 2MB/second egress)$0.015
• PUT Payload Units, per 1,000,000 units $0.014
• Extended Data Retention (Up to 7 days), per Shard Hour $0.020
• DynamoDB price if you use KCL
28. Kafka vs. Kinesis Streams
• In Kafka you can configure, for each topic, the replication factor and how many replicas
have to acknowledge a message before is considered successful.So you can definitely
make it highly available.
• Amazon ensures that you won't lose data, but that comes with a performance cost.
(messages are written to 3 different AZ’s synchronously)
• There are several benchmarks online comparing Kafka and Kinesis, but the result it's
always the same: you'll have a hard time to replicate Kafka's performance in Kinesis. At
least for a reasonable price.
• This is in part is because Kafka is insanely fast, but also because Kinesis writes each
message synchronously to 3 different machines. And this is quite costly in terms of
latency and throughput.
• Kafka is one of the preferred options for the Apache stream processing frameworks
• Unsurprisingly, Kinesis is really well integrated with other AWS services
29. DynamoDB Streams vs. Kinesis Streams
• DynamoDB Streams actions are similar to their
counterparts in Amazon Kinesis Streams, they
are not 100% identical.
• You can write applications for Amazon Kinesis
Streams using the Amazon Kinesis Client Library
(KCL).
• You can leverage the design patterns found
within the KCL to process DynamoDB Streams
shards and stream records. To do this, you use
the DynamoDB Streams Kinesis Adapter
30. SQS vs. Kinesis Streams
• Amazon Kinesis Streams enables real-time
processing of streaming big data.
• It provides ordering of records, as well as the
ability to read and/or replay records in the same
order to multiple Amazon Kinesis Applications.
• The Amazon Kinesis Client Library (KCL)
delivers all records for a given partition key to
the same record processor, making it easier to
build multiple applications reading from the same
Amazon Kinesis stream (for example, to perform
counting, aggregation, and filtering).
• Amazon Simple Queue Service (Amazon SQS)
offers a reliable, highly scalable hosted queue
for storing messages as they travel between
computers.
• Amazon SQS lets you easily move data between
distributed application components and helps
you build applications in which messages are
processed independently (with message-level
ack/fail semantics), such as automated
workflows.
32. Amazon Kinesis Firehose
• Amazon Kinesis Firehose is the easiest way to load streaming data into AWS.
• It can capture, transform, and load streaming data into Amazon Kinesis
Analytics, Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service
• Fully managed service that automatically scales to match the throughput of
your data and requires no ongoing administration.
• It can also batch, compress, and encrypt the data before loading it,
minimizing the amount of storage used at the destination and increasing
security.
33. Amazon Kinesis Analytics
• Process streaming data in real time with standard SQL
• Query streaming data or build entire streaming applications using SQL, so
that you can gain actionable insights and respond to your business and
customer needs promptly.
• Scales automatically to match the volume and throughput rate of your
incoming data
• Only pay for the resources your queries consume. There is no minimum fee
or setup cost.