1.0 - AWS-DAS-Collection-Kinesis.pdf

AWS KINESIS OVERVIEW
ØKinesis is a managed alternative to Apache Kafka
ØGreat for application logs, metrics, IoT, clickstreams
ØGreat for “real-time” big data
ØGreat for streaming processing frameworks (Spark, NiFi, etc...)
ØData is automatically replicated to 3 AZ
ØKinesis Components
Ø Kinesis Streams: low latency streaming ingest at scale
Ø Kinesis Analytics: perform real-time analytics on streams using SQL
Ø Kinesis Firehose: load streams into S3, Redshift, ElasticSearch …

AWS KINESIS OVERVIEW
Streams are divided in ordered Shards / Partitions
ØData retention is 1 day by default, can go up to 7 days
ØAbility to reprocess / replay data
ØMultiple applications can consume the same stream
ØReal-time processing with scale of throughput
ØOnce data is inserted in Kinesis, it can’t be deleted (immutability)

AWS KINESIS STREAMS SHARDS
ØOne stream is made of many different shards
ØBilling is per shard provisioned, can have as many shards as you want
ØBatching available or per message calls.
ØThe number of shards can evolve over time (reshard / merge)
ØRecords are ordered per shard

ØAWS Kinesis Streams Records
ØData Blob: data being sent, serialized
as bytes. Up to 1 MB. Can represent
anything
ØRecord Key:
sent alongside a record, helps to
group records in Shards. Same key =
Same shard.
Use a highly distributed key to
avoid the “hot partition” problem
ØSequence number: Unique identifier
for each records put in shards.
Added by Kinesis after ingestion
AWS KINESIS STREAMS - SHARDS

DATA ORDERING FOR KINESIS
ØImagine you have 100 trucks (truck_1,
truck_2, ... truck_100) on the road
sending their GPS positions regularly into
AWS.
ØYou want to consume the data in order for
each truck, so that you can track their
movement accurately.
ØHow should you send that data into
Kinesis?
ØAnswer : send using a “Partition Key”
value of the “truck_id”
ØThe same key will always go to the
same shard

AWS KINESIS STREAMS RECORDS
ØProducer:
Ø 1MB/s or 1000 messages/s at write PER SHARD
Ø “ProvisionedThroughputException” otherwise
ØConsumer Classic:
Ø 2MB/s at read PER SHARD across all consumers
Ø 5 API calls per second PER SHARD across all consumers
Ø if 3 different applications are consuming, possibility of throttling
ØData Retention:
Ø 24 hours data retention by default
Ø Can be extended to 7 days

AWS KINESIS PRODUCERS
ØKinesis SDK
ØKinesis Producer Library (KPL)
ØKinesis Agent
ØCloudWatch Logs
Ø3rd Party Libraries:
Ø Spark,Log4J Appenders
Ø Flume
Ø Kafka Connect
Ø NiFi, etc…

AWS KINESIS PRODUCER SDK
ØAPIs that are used are PutRecord (one) and PutRecords (many records)
ØPutRecords uses batching and increases throughput => less HTTP requests
ØProvisionedThroughputExceeded if we go over the limits
Ø+ AWS Mobile SDK: Android, iOS, etc...
ØUse case: low throughput, higher latency, simple API, AWS Lambda
ØManaged AWS sources for Kinesis Data Streams:
Ø CloudWatch Logs
Ø AWS IoT
Ø Kinesis Data Analytics

AWS KINESIS API – EXCEPTIONS
ØProvisionedThroughputExceeded Exceptions
ØHappens when sending more data (exceeding MB/s or TPS for any shard)
ØMake sure you don’t have a hot shard (such as your partition key is bad and
too much data goes to that partition)
ØSolution:
ØRetries with backoff
ØIncrease shards (scaling)
ØEnsure your partition key is a good one

KINESIS PRODUCER LIBRARY
ØEasy to use and highly configurable C++ / Java library
ØUsed for building high performance, long-running producers
ØAutomated and configurable retry mechanism
ØSynchronous or Asynchronous API (better performance for async)
ØSubmits metrics to CloudWatch for monitoring
ØBatching (both turned on by default) – increase throughput, decrease cost:
ØCollect Records and Write to multiple shards in the same PutRecords API call
ØAggregate – increased latency
• Capability to store multiple records in one record (go over 1000 records per
second limit)
• Increase payload size and improve throughput (maximize 1MB/s limit)
ØCompression must be implemented by the user
ØKPL Records must be de-coded with KCL or special helper library

KINESIS PRODUCER LIBRARY (KPL) BATCHING
ØWe can influence the batching efficiency by introducing some delay with
RecordMaxBufferedTime (default 100ms)

KINESIS PRODUCER LIBRARY – WHEN NOT TO USE
ØThe KPL can incur an additional processing delay of up to RecordMaxBufferedTime within
the library (user-configurable)
ØLarger values of RecordMaxBufferedTime results in higher packing efficiencies and better
performance
ØApplications that cannot tolerate this additional delay may need to use the AWS SDK
directly

KINESIS AGENT
ØMonitor Log files and sends them to Kinesis Data Streams
ØJava-based agent, built on top of KPL
ØInstall in Linux-based server environments
Features:
Ø Write from multiple directories and write to multiple streams
Ø Routing feature based on directory / log file
Ø Pre-process data before sending to streams (single line, csv to json, log to json...)
Ø The agent handles file rotation, checkpointing, and retry upon failures
Ø Emits metrics to CloudWatch for monitoring

AWS KINESIS CONSUMERS
ØKinesis SDK
ØKinesis Client Library (KCL)
ØKinesis Connector Library
ØKinesis Firehose
ØAWS Lambda
Ø3rd party libraries: Spark,
Log4J Appenders, Flume,
Kafka Connect...
ØKinesis Consumer Enhanced
Fan

KINESIS CONSUMER SDK - GETRECORDS
ØClassic Kinesis - Records are polled
by consumers from a shard
ØEach shard has 2 MB total
aggregate throughput
ØGetRecords returns up to 10MB of
data (then throttle for 5 seconds) or
up to 10000 records
ØMaximum of 5 GetRecords API calls
per shard per second = 200ms
latency
ØIf 5 consumer applications consume
from the same shard, means every
consumer can poll once a second
and receive less than 400 KB/s

KINESIS CLIENT LIBRARY (KCL)
Ø Java-first library but exists for other languages too
(Golang, Python, Ruby, Node, .NET ...)
Ø Read records from Kinesis produced with the KPL (de-
aggregation)
Ø Share multiple shards with multiple consumers in one
“group”, shard discovery
Ø Checkpointing feature to resume progress
Ø Leverages DynamoDB for coordination and
checkpointing (one row per shard)
Ø Make sure you provision enough WCU / RCU
Ø Or use On-Demand for DynamoDB
Ø Otherwise DynamoDB may slow down KCL
Ø Record processors will process the data
Ø ExpiredIteratorException => increase WCU

KINESIS CONNECTOR LIBRARY
ØOlder Java library (2016), leverages
the KCL library
ØWrite data to:
Ø Amazon S3
Ø DynamoDB
Ø Redshift
Ø ElasticSearch
ØKinesis Firehose replaces the
Connector Library for a few of
these targets, Lambda for the
others

AWS LAMBDA SOURCING FROM KINESIS
ØAWS Lambda can source records from Kinesis Data Streams
ØLambda consumer has a library to de-aggregate record from the KPL
ØLambda can be used to run lightweight ETL to:
Ø Amazon S3
Ø DynamoDB
Ø Redshift
Ø ElasticSearch
Ø Anywhere you want
ØLambda can be used to trigger notifications / send emails in real time
ØLambda has a configurable batch size (more in Lambda section)

KINESIS ENHANCED FAN OUT
ØNew game-changing feature from August 2018.
ØWorks with KCL 2.0 and AWS Lambda (Nov 2018)
ØEach Consumer get 2 MB/s of provisioned
throughput per shard
ØThat means 20 consumers will get 40MB/s per
shard aggregated
ØNo more 2 MB/s limit!
ØEnhanced Fan Out: Kinesis pushes data to
consumers over HTTP/2
ØReduce latency (~70 ms)

ENHANCED FAN-OUT VS STANDARD CONSUMERS
ØStandard consumers:
Ø Low number of consuming applications (1,2,3...)
Ø Can tolerate ~200 ms latency
Ø Minimize cost
ØEnhanced Fan Out Consumers:
Ø Multiple Consumer applications for the same Stream
Ø Low Latency requirements ~70ms
Ø Higher costs (see Kinesis pricing page)
Ø Default limit of 5 consumers using enhanced fan-out per data stream

KINESIS OPERATIONS – ADDING SHARDS
ØAlso called “Shard Splitting”
ØCan be used to increase the Stream capacity (1 MB/s data in per shard)
ØCan be used to divide a “hot shard”
ØThe old shard is closed and will be deleted once the data is expired

KINESIS OPERATIONS – MERGING SHARDS
ØDecrease the Stream capacity and save costs
ØCan be used to group two shards with low traffic
ØOld shards are closed and deleted based on data expiration

OUT-OF-ORDER RECORDS AFTER RESHARDING
ØAfter a reshard, you can read from child
shards
ØHowever, data you haven’t read yet could
still be in the parent
ØIf you start reading the child before
completing reading the parent, you could
read data for a particular hash key out of
order
ØAfter a reshard, read entirely from the
parent until you don’t have new records
ØNote: The Kinesis Client Library (KCL) has
this logic already built-in, even after
resharding operations

KINESIS OPERATIONS – AUTO SCALING
ØAuto Scaling is not a native feature of
Kinesis
ØThe API call to change the number of
shards is UpdateShardCount
ØWe can implement Auto Scaling with AWS
Lambda
ØSee:
Ø https://aws.amazon.com/blogs/b ig-
data/scaling-amazon-kinesis- data-
streams-with-aws- application-auto-
scaling/

KINESIS SCALING LIMITATIONS
ØResharding cannot be done in parallel. Plan capacity in advance
ØYou can only perform one resharding operation at a time and it takes a few seconds
ØFor 1000 shards, it takes 30K seconds (8.3 hours) to double the shards to 2000
You can’t do the following:
Ø Scale more than twice for each rolling 24-hour period for each stream
Ø Scale up to more than double your current shard count for a stream
Ø Scale down below half your current shard count for a stream
Ø Scale up to more than 10000 shards in a stream
Ø Scale a stream with more than 10000 shards down unless the result is fewer than 10000
shards
Ø Scale up to more than the shard limit for your account
ØIf you need to scale more than once a day, you can request amazon to increase to this
limit

KINESIS SECURITY
ØControl access / authorization using IAM policies
ØEncryption in flight using HTTPS endpoints
ØEncryption at rest using KMS
ØClient side encryption must be manually implemented (harder)
ØVPC Endpoints available for Kinesis to access within VPC

SHARDS KINESIS DATA STREAMS – HANDLING DUPLICATES FOR
PRODUCERS
ØProducer retries can create duplicates due to network timeouts
ØAlthough the two records have identical data, they also have unique sequence number
Ø Fix: embed unique record ID in the data to de-duplicate on the consumer side

KINESIS DATA STREAMS – HANDLING DUPLICATES
FOR CONSUMERS
ØConsumer retries can make your application read the same data twice
ØConsumer retries happen when record processors restart:
Ø A worker terminates unexpectedly
Ø Worker instances are added or removed
Ø Shards are merged or split
Ø The application is deployed
ØFixes:
Ø Make your consumer application idempotent
Ø If the final destination can handle duplicates, it’s recommended to do it there
ØMore info: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-
duplicates.html

AWS KINESIS DATA FIREHOSE
ØFully Managed Service, no administration
ØNear Real Time (60 seconds latency minimum for non full batches)
ØLoad data into Redshift / Amazon S3 / ElasticSearch / Splunk
ØAutomatic scaling
ØSupports many data formats
ØData Conversions from JSON to Parquet / ORC (only for S3)
ØData Transformation through AWS Lambda (ex: CSV => JSON)
ØSupports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)
ØOnly GZIP is the data is further loaded into Redshift
ØSpark / KCL do not read from KDF
ØPay for the amount of data going through Firehose

AWS KINESIS DATA FIREHOSE DIAGRAM

KINESIS DATA FIREHOSE DELIVERY DIAGRAM

FIREHOSE BUFFER SIZING
ØFirehose accumulates records in a buffer
ØThe buffer is flushed based on time and size rules
ØBuffer Size (ex: 32MB): if that buffer size is reached, it’s flushed
ØBuffer Time (ex: 2 minutes): if that time is reached, it’s flushed
ØFirehose can automatically increase the buffer size to increase throughput
ØHigh throughput => Buffer Size will be hit
ØLow throughput => Buffer Time will be hit

AWS KINESIS DATA STREAMS VS FIREHOSE
Streams
ì Going to write custom code (producer / consumer)
ì Real time (~200 ms latency for classic)
ì Must manage scaling (shard splitting / merging)
ì Data Storage for 1 to 7 days, replay capability, multi consumers
ì Use with Lambda to insert data in real-time to ElasticSearch (for example)
Firehose
ì Fully managed, send to S3, Splunk, Redshift, ElasticSearch
ì Serverless data transformations with Lambda
ì Near real time (lowest buffer time is 1 minute)
ì Automated Scaling
ì No data storage

CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS
ØYou can stream CloudWatch Logs into
Ø Kinesis Data Streams
Ø Kinesis Data Firehose
Ø AWS Lambda
Ø Using CloudWatch Logs Subscriptions Filters
ØYou can enable them using the AWS CLI

CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS NEAR REAL
TIME INTO AMAZON ES

CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME INTO
AMAZON ES

CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME ANALYTICS

USE CASE – EXERCISE 1 – DATA COLLECTION
• Creating a Kinesis Firehose delivery stream
• Generate OrderHistory csv file using a LogGenerator Python script
• Publishing the data to an S3 bucket from firehose using Kinesis Agent
• Create a Kinesis Data Stream
• Publish data from the Kinesis agent to Kinesis Data stream

1.0 - AWS-DAS-Collection-Kinesis.pdf

Recommended

Recommended

More Related Content

Similar to 1.0 - AWS-DAS-Collection-Kinesis.pdf

Similar to 1.0 - AWS-DAS-Collection-Kinesis.pdf (20)

More from SreeGe1

More from SreeGe1 (8)

Recently uploaded

Recently uploaded (20)

1.0 - AWS-DAS-Collection-Kinesis.pdf