2. DATA COLLECTION INTRODUCTION
ØReal Time - Immediate actions
Ø Kinesis Data Streams (KDS)
Ø Simple Queue Service (SQS)
Ø Internet of Things (IoT)
ØNear-real time - Reactive actions
Ø Kinesis Data Firehose (KDF)
Ø Database Migration Service (DMS)
ØBatch - Historical Analysis
Ø Snowball
ØData Pipeline
4. AWS KINESIS OVERVIEW
ØKinesis is a managed alternative to Apache Kafka
ØGreat for application logs, metrics, IoT, clickstreams
ØGreat for “real-time” big data
ØGreat for streaming processing frameworks (Spark, NiFi, etc...)
ØData is automatically replicated to 3 AZ
ØKinesis Components
Ø Kinesis Streams: low latency streaming ingest at scale
Ø Kinesis Analytics: perform real-time analytics on streams using SQL
Ø Kinesis Firehose: load streams into S3, Redshift, ElasticSearch …
6. AWS KINESIS OVERVIEW
Streams are divided in ordered Shards / Partitions
ØData retention is 1 day by default, can go up to 7 days
ØAbility to reprocess / replay data
ØMultiple applications can consume the same stream
ØReal-time processing with scale of throughput
ØOnce data is inserted in Kinesis, it can’t be deleted (immutability)
7. AWS KINESIS STREAMS SHARDS
ØOne stream is made of many different shards
ØBilling is per shard provisioned, can have as many shards as you want
ØBatching available or per message calls.
ØThe number of shards can evolve over time (reshard / merge)
ØRecords are ordered per shard
8. ØAWS Kinesis Streams Records
ØData Blob: data being sent, serialized
as bytes. Up to 1 MB. Can represent
anything
ØRecord Key:
sent alongside a record, helps to
group records in Shards. Same key =
Same shard.
Use a highly distributed key to
avoid the “hot partition” problem
ØSequence number: Unique identifier
for each records put in shards.
Added by Kinesis after ingestion
AWS KINESIS STREAMS - SHARDS
9. DATA ORDERING FOR KINESIS
ØImagine you have 100 trucks (truck_1,
truck_2, ... truck_100) on the road
sending their GPS positions regularly into
AWS.
ØYou want to consume the data in order for
each truck, so that you can track their
movement accurately.
ØHow should you send that data into
Kinesis?
ØAnswer : send using a “Partition Key”
value of the “truck_id”
ØThe same key will always go to the
same shard
10. AWS KINESIS STREAMS RECORDS
ØProducer:
Ø 1MB/s or 1000 messages/s at write PER SHARD
Ø “ProvisionedThroughputException” otherwise
ØConsumer Classic:
Ø 2MB/s at read PER SHARD across all consumers
Ø 5 API calls per second PER SHARD across all consumers
Ø if 3 different applications are consuming, possibility of throttling
ØData Retention:
Ø 24 hours data retention by default
Ø Can be extended to 7 days
12. AWS KINESIS PRODUCER SDK
ØAPIs that are used are PutRecord (one) and PutRecords (many records)
ØPutRecords uses batching and increases throughput => less HTTP requests
ØProvisionedThroughputExceeded if we go over the limits
Ø+ AWS Mobile SDK: Android, iOS, etc...
ØUse case: low throughput, higher latency, simple API, AWS Lambda
ØManaged AWS sources for Kinesis Data Streams:
Ø CloudWatch Logs
Ø AWS IoT
Ø Kinesis Data Analytics
13. AWS KINESIS API – EXCEPTIONS
ØProvisionedThroughputExceeded Exceptions
ØHappens when sending more data (exceeding MB/s or TPS for any shard)
ØMake sure you don’t have a hot shard (such as your partition key is bad and
too much data goes to that partition)
ØSolution:
ØRetries with backoff
ØIncrease shards (scaling)
ØEnsure your partition key is a good one
14. KINESIS PRODUCER LIBRARY
ØEasy to use and highly configurable C++ / Java library
ØUsed for building high performance, long-running producers
ØAutomated and configurable retry mechanism
ØSynchronous or Asynchronous API (better performance for async)
ØSubmits metrics to CloudWatch for monitoring
ØBatching (both turned on by default) – increase throughput, decrease cost:
ØCollect Records and Write to multiple shards in the same PutRecords API call
ØAggregate – increased latency
• Capability to store multiple records in one record (go over 1000 records per
second limit)
• Increase payload size and improve throughput (maximize 1MB/s limit)
ØCompression must be implemented by the user
ØKPL Records must be de-coded with KCL or special helper library
15. KINESIS PRODUCER LIBRARY (KPL) BATCHING
ØWe can influence the batching efficiency by introducing some delay with
RecordMaxBufferedTime (default 100ms)
16. KINESIS PRODUCER LIBRARY – WHEN NOT TO USE
ØThe KPL can incur an additional processing delay of up to RecordMaxBufferedTime within
the library (user-configurable)
ØLarger values of RecordMaxBufferedTime results in higher packing efficiencies and better
performance
ØApplications that cannot tolerate this additional delay may need to use the AWS SDK
directly
17. KINESIS AGENT
ØMonitor Log files and sends them to Kinesis Data Streams
ØJava-based agent, built on top of KPL
ØInstall in Linux-based server environments
Features:
Ø Write from multiple directories and write to multiple streams
Ø Routing feature based on directory / log file
Ø Pre-process data before sending to streams (single line, csv to json, log to json...)
Ø The agent handles file rotation, checkpointing, and retry upon failures
Ø Emits metrics to CloudWatch for monitoring
19. KINESIS CONSUMER SDK - GETRECORDS
ØClassic Kinesis - Records are polled
by consumers from a shard
ØEach shard has 2 MB total
aggregate throughput
ØGetRecords returns up to 10MB of
data (then throttle for 5 seconds) or
up to 10000 records
ØMaximum of 5 GetRecords API calls
per shard per second = 200ms
latency
ØIf 5 consumer applications consume
from the same shard, means every
consumer can poll once a second
and receive less than 400 KB/s
20. KINESIS CLIENT LIBRARY (KCL)
Ø Java-first library but exists for other languages too
(Golang, Python, Ruby, Node, .NET ...)
Ø Read records from Kinesis produced with the KPL (de-
aggregation)
Ø Share multiple shards with multiple consumers in one
“group”, shard discovery
Ø Checkpointing feature to resume progress
Ø Leverages DynamoDB for coordination and
checkpointing (one row per shard)
Ø Make sure you provision enough WCU / RCU
Ø Or use On-Demand for DynamoDB
Ø Otherwise DynamoDB may slow down KCL
Ø Record processors will process the data
Ø ExpiredIteratorException => increase WCU
21. KINESIS CONNECTOR LIBRARY
ØOlder Java library (2016), leverages
the KCL library
ØWrite data to:
Ø Amazon S3
Ø DynamoDB
Ø Redshift
Ø ElasticSearch
ØKinesis Firehose replaces the
Connector Library for a few of
these targets, Lambda for the
others
22. AWS LAMBDA SOURCING FROM KINESIS
ØAWS Lambda can source records from Kinesis Data Streams
ØLambda consumer has a library to de-aggregate record from the KPL
ØLambda can be used to run lightweight ETL to:
Ø Amazon S3
Ø DynamoDB
Ø Redshift
Ø ElasticSearch
Ø Anywhere you want
ØLambda can be used to trigger notifications / send emails in real time
ØLambda has a configurable batch size (more in Lambda section)
23. KINESIS ENHANCED FAN OUT
ØNew game-changing feature from August 2018.
ØWorks with KCL 2.0 and AWS Lambda (Nov 2018)
ØEach Consumer get 2 MB/s of provisioned
throughput per shard
ØThat means 20 consumers will get 40MB/s per
shard aggregated
ØNo more 2 MB/s limit!
ØEnhanced Fan Out: Kinesis pushes data to
consumers over HTTP/2
ØReduce latency (~70 ms)
24. ENHANCED FAN-OUT VS STANDARD CONSUMERS
ØStandard consumers:
Ø Low number of consuming applications (1,2,3...)
Ø Can tolerate ~200 ms latency
Ø Minimize cost
ØEnhanced Fan Out Consumers:
Ø Multiple Consumer applications for the same Stream
Ø Low Latency requirements ~70ms
Ø Higher costs (see Kinesis pricing page)
Ø Default limit of 5 consumers using enhanced fan-out per data stream
25. KINESIS OPERATIONS – ADDING SHARDS
ØAlso called “Shard Splitting”
ØCan be used to increase the Stream capacity (1 MB/s data in per shard)
ØCan be used to divide a “hot shard”
ØThe old shard is closed and will be deleted once the data is expired
26. KINESIS OPERATIONS – MERGING SHARDS
ØDecrease the Stream capacity and save costs
ØCan be used to group two shards with low traffic
ØOld shards are closed and deleted based on data expiration
27. OUT-OF-ORDER RECORDS AFTER RESHARDING
ØAfter a reshard, you can read from child
shards
ØHowever, data you haven’t read yet could
still be in the parent
ØIf you start reading the child before
completing reading the parent, you could
read data for a particular hash key out of
order
ØAfter a reshard, read entirely from the
parent until you don’t have new records
ØNote: The Kinesis Client Library (KCL) has
this logic already built-in, even after
resharding operations
28. KINESIS OPERATIONS – AUTO SCALING
ØAuto Scaling is not a native feature of
Kinesis
ØThe API call to change the number of
shards is UpdateShardCount
ØWe can implement Auto Scaling with AWS
Lambda
ØSee:
Ø https://aws.amazon.com/blogs/b ig-
data/scaling-amazon-kinesis- data-
streams-with-aws- application-auto-
scaling/
29. KINESIS SCALING LIMITATIONS
ØResharding cannot be done in parallel. Plan capacity in advance
ØYou can only perform one resharding operation at a time and it takes a few seconds
ØFor 1000 shards, it takes 30K seconds (8.3 hours) to double the shards to 2000
You can’t do the following:
Ø Scale more than twice for each rolling 24-hour period for each stream
Ø Scale up to more than double your current shard count for a stream
Ø Scale down below half your current shard count for a stream
Ø Scale up to more than 10000 shards in a stream
Ø Scale a stream with more than 10000 shards down unless the result is fewer than 10000
shards
Ø Scale up to more than the shard limit for your account
ØIf you need to scale more than once a day, you can request amazon to increase to this
limit
30. KINESIS SECURITY
ØControl access / authorization using IAM policies
ØEncryption in flight using HTTPS endpoints
ØEncryption at rest using KMS
ØClient side encryption must be manually implemented (harder)
ØVPC Endpoints available for Kinesis to access within VPC
31. SHARDS KINESIS DATA STREAMS – HANDLING DUPLICATES FOR
PRODUCERS
ØProducer retries can create duplicates due to network timeouts
ØAlthough the two records have identical data, they also have unique sequence number
Ø Fix: embed unique record ID in the data to de-duplicate on the consumer side
32. KINESIS DATA STREAMS – HANDLING DUPLICATES
FOR CONSUMERS
ØConsumer retries can make your application read the same data twice
ØConsumer retries happen when record processors restart:
Ø A worker terminates unexpectedly
Ø Worker instances are added or removed
Ø Shards are merged or split
Ø The application is deployed
ØFixes:
Ø Make your consumer application idempotent
Ø If the final destination can handle duplicates, it’s recommended to do it there
ØMore info: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-
duplicates.html
33. AWS KINESIS DATA FIREHOSE
ØFully Managed Service, no administration
ØNear Real Time (60 seconds latency minimum for non full batches)
ØLoad data into Redshift / Amazon S3 / ElasticSearch / Splunk
ØAutomatic scaling
ØSupports many data formats
ØData Conversions from JSON to Parquet / ORC (only for S3)
ØData Transformation through AWS Lambda (ex: CSV => JSON)
ØSupports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)
ØOnly GZIP is the data is further loaded into Redshift
ØSpark / KCL do not read from KDF
ØPay for the amount of data going through Firehose
36. FIREHOSE BUFFER SIZING
ØFirehose accumulates records in a buffer
ØThe buffer is flushed based on time and size rules
ØBuffer Size (ex: 32MB): if that buffer size is reached, it’s flushed
ØBuffer Time (ex: 2 minutes): if that time is reached, it’s flushed
ØFirehose can automatically increase the buffer size to increase throughput
ØHigh throughput => Buffer Size will be hit
ØLow throughput => Buffer Time will be hit
37. AWS KINESIS DATA STREAMS VS FIREHOSE
Streams
ì Going to write custom code (producer / consumer)
ì Real time (~200 ms latency for classic)
ì Must manage scaling (shard splitting / merging)
ì Data Storage for 1 to 7 days, replay capability, multi consumers
ì Use with Lambda to insert data in real-time to ElasticSearch (for example)
Firehose
ì Fully managed, send to S3, Splunk, Redshift, ElasticSearch
ì Serverless data transformations with Lambda
ì Near real time (lowest buffer time is 1 minute)
ì Automated Scaling
ì No data storage
38. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS
ØYou can stream CloudWatch Logs into
Ø Kinesis Data Streams
Ø Kinesis Data Firehose
Ø AWS Lambda
Ø Using CloudWatch Logs Subscriptions Filters
ØYou can enable them using the AWS CLI
42. USE CASE – EXERCISE 1 – DATA COLLECTION
• Creating a Kinesis Firehose delivery stream
• Generate OrderHistory csv file using a LogGenerator Python script
• Publishing the data to an S3 bucket from firehose using Kinesis Agent
• Create a Kinesis Data Stream
• Publish data from the Kinesis agent to Kinesis Data stream