SlideShare a Scribd company logo
BIG DATA SPECIALTY
CERTIFICATION
AWS CERTIFIED DATA ANALYTICS SPECIALTY COURSE
DATA COLLECTION
DATA COLLECTION INTRODUCTION
ØReal Time - Immediate actions
Ø Kinesis Data Streams (KDS)
Ø Simple Queue Service (SQS)
Ø Internet of Things (IoT)
ØNear-real time - Reactive actions
Ø Kinesis Data Firehose (KDF)
Ø Database Migration Service (DMS)
ØBatch - Historical Analysis
Ø Snowball
ØData Pipeline
AWS KINESIS
AWS KINESIS OVERVIEW
ØKinesis is a managed alternative to Apache Kafka
ØGreat for application logs, metrics, IoT, clickstreams
ØGreat for “real-time” big data
ØGreat for streaming processing frameworks (Spark, NiFi, etc...)
ØData is automatically replicated to 3 AZ
ØKinesis Components
Ø Kinesis Streams: low latency streaming ingest at scale
Ø Kinesis Analytics: perform real-time analytics on streams using SQL
Ø Kinesis Firehose: load streams into S3, Redshift, ElasticSearch …
AWS KINESIS EXAMPLE
AWS KINESIS OVERVIEW
Streams are divided in ordered Shards / Partitions
ØData retention is 1 day by default, can go up to 7 days
ØAbility to reprocess / replay data
ØMultiple applications can consume the same stream
ØReal-time processing with scale of throughput
ØOnce data is inserted in Kinesis, it can’t be deleted (immutability)
AWS KINESIS STREAMS SHARDS
ØOne stream is made of many different shards
ØBilling is per shard provisioned, can have as many shards as you want
ØBatching available or per message calls.
ØThe number of shards can evolve over time (reshard / merge)
ØRecords are ordered per shard
ØAWS Kinesis Streams Records
ØData Blob: data being sent, serialized
as bytes. Up to 1 MB. Can represent
anything
ØRecord Key:
sent alongside a record, helps to
group records in Shards. Same key =
Same shard.
Use a highly distributed key to
avoid the “hot partition” problem
ØSequence number: Unique identifier
for each records put in shards.
Added by Kinesis after ingestion
AWS KINESIS STREAMS - SHARDS
DATA ORDERING FOR KINESIS
ØImagine you have 100 trucks (truck_1,
truck_2, ... truck_100) on the road
sending their GPS positions regularly into
AWS.
ØYou want to consume the data in order for
each truck, so that you can track their
movement accurately.
ØHow should you send that data into
Kinesis?
ØAnswer : send using a “Partition Key”
value of the “truck_id”
ØThe same key will always go to the
same shard
AWS KINESIS STREAMS RECORDS
ØProducer:
Ø 1MB/s or 1000 messages/s at write PER SHARD
Ø “ProvisionedThroughputException” otherwise
ØConsumer Classic:
Ø 2MB/s at read PER SHARD across all consumers
Ø 5 API calls per second PER SHARD across all consumers
Ø if 3 different applications are consuming, possibility of throttling
ØData Retention:
Ø 24 hours data retention by default
Ø Can be extended to 7 days
AWS KINESIS PRODUCERS
ØKinesis SDK
ØKinesis Producer Library (KPL)
ØKinesis Agent
ØCloudWatch Logs
Ø3rd Party Libraries:
Ø Spark,Log4J Appenders
Ø Flume
Ø Kafka Connect
Ø NiFi, etc…
AWS KINESIS PRODUCER SDK
ØAPIs that are used are PutRecord (one) and PutRecords (many records)
ØPutRecords uses batching and increases throughput => less HTTP requests
ØProvisionedThroughputExceeded if we go over the limits
Ø+ AWS Mobile SDK: Android, iOS, etc...
ØUse case: low throughput, higher latency, simple API, AWS Lambda
ØManaged AWS sources for Kinesis Data Streams:
Ø CloudWatch Logs
Ø AWS IoT
Ø Kinesis Data Analytics
AWS KINESIS API – EXCEPTIONS
ØProvisionedThroughputExceeded Exceptions
ØHappens when sending more data (exceeding MB/s or TPS for any shard)
ØMake sure you don’t have a hot shard (such as your partition key is bad and
too much data goes to that partition)
ØSolution:
ØRetries with backoff
ØIncrease shards (scaling)
ØEnsure your partition key is a good one
KINESIS PRODUCER LIBRARY
ØEasy to use and highly configurable C++ / Java library
ØUsed for building high performance, long-running producers
ØAutomated and configurable retry mechanism
ØSynchronous or Asynchronous API (better performance for async)
ØSubmits metrics to CloudWatch for monitoring
ØBatching (both turned on by default) – increase throughput, decrease cost:
ØCollect Records and Write to multiple shards in the same PutRecords API call
ØAggregate – increased latency
• Capability to store multiple records in one record (go over 1000 records per
second limit)
• Increase payload size and improve throughput (maximize 1MB/s limit)
ØCompression must be implemented by the user
ØKPL Records must be de-coded with KCL or special helper library
KINESIS PRODUCER LIBRARY (KPL) BATCHING
ØWe can influence the batching efficiency by introducing some delay with
RecordMaxBufferedTime (default 100ms)
KINESIS PRODUCER LIBRARY – WHEN NOT TO USE
ØThe KPL can incur an additional processing delay of up to RecordMaxBufferedTime within
the library (user-configurable)
ØLarger values of RecordMaxBufferedTime results in higher packing efficiencies and better
performance
ØApplications that cannot tolerate this additional delay may need to use the AWS SDK
directly
KINESIS AGENT
ØMonitor Log files and sends them to Kinesis Data Streams
ØJava-based agent, built on top of KPL
ØInstall in Linux-based server environments
Features:
Ø Write from multiple directories and write to multiple streams
Ø Routing feature based on directory / log file
Ø Pre-process data before sending to streams (single line, csv to json, log to json...)
Ø The agent handles file rotation, checkpointing, and retry upon failures
Ø Emits metrics to CloudWatch for monitoring
AWS KINESIS CONSUMERS
ØKinesis SDK
ØKinesis Client Library (KCL)
ØKinesis Connector Library
ØKinesis Firehose
ØAWS Lambda
Ø3rd party libraries: Spark,
Log4J Appenders, Flume,
Kafka Connect...
ØKinesis Consumer Enhanced
Fan
KINESIS CONSUMER SDK - GETRECORDS
ØClassic Kinesis - Records are polled
by consumers from a shard
ØEach shard has 2 MB total
aggregate throughput
ØGetRecords returns up to 10MB of
data (then throttle for 5 seconds) or
up to 10000 records
ØMaximum of 5 GetRecords API calls
per shard per second = 200ms
latency
ØIf 5 consumer applications consume
from the same shard, means every
consumer can poll once a second
and receive less than 400 KB/s
KINESIS CLIENT LIBRARY (KCL)
Ø Java-first library but exists for other languages too
(Golang, Python, Ruby, Node, .NET ...)
Ø Read records from Kinesis produced with the KPL (de-
aggregation)
Ø Share multiple shards with multiple consumers in one
“group”, shard discovery
Ø Checkpointing feature to resume progress
Ø Leverages DynamoDB for coordination and
checkpointing (one row per shard)
Ø Make sure you provision enough WCU / RCU
Ø Or use On-Demand for DynamoDB
Ø Otherwise DynamoDB may slow down KCL
Ø Record processors will process the data
Ø ExpiredIteratorException => increase WCU
KINESIS CONNECTOR LIBRARY
ØOlder Java library (2016), leverages
the KCL library
ØWrite data to:
Ø Amazon S3
Ø DynamoDB
Ø Redshift
Ø ElasticSearch
ØKinesis Firehose replaces the
Connector Library for a few of
these targets, Lambda for the
others
AWS LAMBDA SOURCING FROM KINESIS
ØAWS Lambda can source records from Kinesis Data Streams
ØLambda consumer has a library to de-aggregate record from the KPL
ØLambda can be used to run lightweight ETL to:
Ø Amazon S3
Ø DynamoDB
Ø Redshift
Ø ElasticSearch
Ø Anywhere you want
ØLambda can be used to trigger notifications / send emails in real time
ØLambda has a configurable batch size (more in Lambda section)
KINESIS ENHANCED FAN OUT
ØNew game-changing feature from August 2018.
ØWorks with KCL 2.0 and AWS Lambda (Nov 2018)
ØEach Consumer get 2 MB/s of provisioned
throughput per shard
ØThat means 20 consumers will get 40MB/s per
shard aggregated
ØNo more 2 MB/s limit!
ØEnhanced Fan Out: Kinesis pushes data to
consumers over HTTP/2
ØReduce latency (~70 ms)
ENHANCED FAN-OUT VS STANDARD CONSUMERS
ØStandard consumers:
Ø Low number of consuming applications (1,2,3...)
Ø Can tolerate ~200 ms latency
Ø Minimize cost
ØEnhanced Fan Out Consumers:
Ø Multiple Consumer applications for the same Stream
Ø Low Latency requirements ~70ms
Ø Higher costs (see Kinesis pricing page)
Ø Default limit of 5 consumers using enhanced fan-out per data stream
KINESIS OPERATIONS – ADDING SHARDS
ØAlso called “Shard Splitting”
ØCan be used to increase the Stream capacity (1 MB/s data in per shard)
ØCan be used to divide a “hot shard”
ØThe old shard is closed and will be deleted once the data is expired
KINESIS OPERATIONS – MERGING SHARDS
ØDecrease the Stream capacity and save costs
ØCan be used to group two shards with low traffic
ØOld shards are closed and deleted based on data expiration
OUT-OF-ORDER RECORDS AFTER RESHARDING
ØAfter a reshard, you can read from child
shards
ØHowever, data you haven’t read yet could
still be in the parent
ØIf you start reading the child before
completing reading the parent, you could
read data for a particular hash key out of
order
ØAfter a reshard, read entirely from the
parent until you don’t have new records
ØNote: The Kinesis Client Library (KCL) has
this logic already built-in, even after
resharding operations
KINESIS OPERATIONS – AUTO SCALING
ØAuto Scaling is not a native feature of
Kinesis
ØThe API call to change the number of
shards is UpdateShardCount
ØWe can implement Auto Scaling with AWS
Lambda
ØSee:
Ø https://aws.amazon.com/blogs/b ig-
data/scaling-amazon-kinesis- data-
streams-with-aws- application-auto-
scaling/
KINESIS SCALING LIMITATIONS
ØResharding cannot be done in parallel. Plan capacity in advance
ØYou can only perform one resharding operation at a time and it takes a few seconds
ØFor 1000 shards, it takes 30K seconds (8.3 hours) to double the shards to 2000
You can’t do the following:
Ø Scale more than twice for each rolling 24-hour period for each stream
Ø Scale up to more than double your current shard count for a stream
Ø Scale down below half your current shard count for a stream
Ø Scale up to more than 10000 shards in a stream
Ø Scale a stream with more than 10000 shards down unless the result is fewer than 10000
shards
Ø Scale up to more than the shard limit for your account
ØIf you need to scale more than once a day, you can request amazon to increase to this
limit
KINESIS SECURITY
ØControl access / authorization using IAM policies
ØEncryption in flight using HTTPS endpoints
ØEncryption at rest using KMS
ØClient side encryption must be manually implemented (harder)
ØVPC Endpoints available for Kinesis to access within VPC
SHARDS KINESIS DATA STREAMS – HANDLING DUPLICATES FOR
PRODUCERS
ØProducer retries can create duplicates due to network timeouts
ØAlthough the two records have identical data, they also have unique sequence number
Ø Fix: embed unique record ID in the data to de-duplicate on the consumer side
KINESIS DATA STREAMS – HANDLING DUPLICATES
FOR CONSUMERS
ØConsumer retries can make your application read the same data twice
ØConsumer retries happen when record processors restart:
Ø A worker terminates unexpectedly
Ø Worker instances are added or removed
Ø Shards are merged or split
Ø The application is deployed
ØFixes:
Ø Make your consumer application idempotent
Ø If the final destination can handle duplicates, it’s recommended to do it there
ØMore info: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-
duplicates.html
AWS KINESIS DATA FIREHOSE
ØFully Managed Service, no administration
ØNear Real Time (60 seconds latency minimum for non full batches)
ØLoad data into Redshift / Amazon S3 / ElasticSearch / Splunk
ØAutomatic scaling
ØSupports many data formats
ØData Conversions from JSON to Parquet / ORC (only for S3)
ØData Transformation through AWS Lambda (ex: CSV => JSON)
ØSupports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)
ØOnly GZIP is the data is further loaded into Redshift
ØSpark / KCL do not read from KDF
ØPay for the amount of data going through Firehose
AWS KINESIS DATA FIREHOSE DIAGRAM
KINESIS DATA FIREHOSE DELIVERY DIAGRAM
FIREHOSE BUFFER SIZING
ØFirehose accumulates records in a buffer
ØThe buffer is flushed based on time and size rules
ØBuffer Size (ex: 32MB): if that buffer size is reached, it’s flushed
ØBuffer Time (ex: 2 minutes): if that time is reached, it’s flushed
ØFirehose can automatically increase the buffer size to increase throughput
ØHigh throughput => Buffer Size will be hit
ØLow throughput => Buffer Time will be hit
AWS KINESIS DATA STREAMS VS FIREHOSE
Streams
ĂŹ Going to write custom code (producer / consumer)
ĂŹ Real time (~200 ms latency for classic)
ĂŹ Must manage scaling (shard splitting / merging)
ĂŹ Data Storage for 1 to 7 days, replay capability, multi consumers
ĂŹ Use with Lambda to insert data in real-time to ElasticSearch (for example)
Firehose
ĂŹ Fully managed, send to S3, Splunk, Redshift, ElasticSearch
ĂŹ Serverless data transformations with Lambda
ĂŹ Near real time (lowest buffer time is 1 minute)
ĂŹ Automated Scaling
ĂŹ No data storage
CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS
ØYou can stream CloudWatch Logs into
Ø Kinesis Data Streams
Ø Kinesis Data Firehose
Ø AWS Lambda
Ø Using CloudWatch Logs Subscriptions Filters
ØYou can enable them using the AWS CLI
CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS NEAR REAL
TIME INTO AMAZON ES
CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME INTO
AMAZON ES
CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME ANALYTICS
USE CASE – EXERCISE 1 – DATA COLLECTION
• Creating a Kinesis Firehose delivery stream
• Generate OrderHistory csv file using a LogGenerator Python script
• Publishing the data to an S3 bucket from firehose using Kinesis Agent
• Create a Kinesis Data Stream
• Publish data from the Kinesis agent to Kinesis Data stream

More Related Content

Similar to 1.0 - AWS-DAS-Collection-Kinesis.pdf

Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
Amazon Web Services
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
Amazon Web Services
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Amazon Web Services
 
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Amazon Web Services
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
Amazon Web Services
 
Em tempo real: IngestĂŁo, processamento e analise de dados
Em tempo real: IngestĂŁo, processamento e analise de dadosEm tempo real: IngestĂŁo, processamento e analise de dados
Em tempo real: IngestĂŁo, processamento e analise de dados
Amazon Web Services LATAM
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
Amazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
Amazon Web Services
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
Mia D Champion
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Emprovise
 
Raleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS LambdaRaleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS Lambda
Amazon Web Services
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
Amazon Web Services Korea
 
Randall's re:Invent Recap
Randall's re:Invent RecapRandall's re:Invent Recap
Randall's re:Invent Recap
Randall Hunt
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
Amazon Web Services
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS Riyadh User Group
 
AWS Kinesis
AWS KinesisAWS Kinesis
AWS Kinesis
Julian Kleinhans
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
Amazon Web Services
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
Amazon Web Services Korea
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Amazon Web Services
 

Similar to 1.0 - AWS-DAS-Collection-Kinesis.pdf (20)

Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Em tempo real: IngestĂŁo, processamento e analise de dados
Em tempo real: IngestĂŁo, processamento e analise de dadosEm tempo real: IngestĂŁo, processamento e analise de dados
Em tempo real: IngestĂŁo, processamento e analise de dados
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Raleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS LambdaRaleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS Lambda
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
Randall's re:Invent Recap
Randall's re:Invent RecapRandall's re:Invent Recap
Randall's re:Invent Recap
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
 
AWS Kinesis
AWS KinesisAWS Kinesis
AWS Kinesis
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 

More from SreeGe1

Lagom.pptx
Lagom.pptxLagom.pptx
Lagom.pptx
SreeGe1
 
ASUG83511 - Accelerate Digital Transformation at General Mills.pdf
ASUG83511 - Accelerate Digital Transformation at General Mills.pdfASUG83511 - Accelerate Digital Transformation at General Mills.pdf
ASUG83511 - Accelerate Digital Transformation at General Mills.pdf
SreeGe1
 
7204_webinar_presentation_29th_april_2020.pdf
7204_webinar_presentation_29th_april_2020.pdf7204_webinar_presentation_29th_april_2020.pdf
7204_webinar_presentation_29th_april_2020.pdf
SreeGe1
 
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
SreeGe1
 
EBS-Questions.pdf
EBS-Questions.pdfEBS-Questions.pdf
EBS-Questions.pdf
SreeGe1
 
S3-Questions.pdf
S3-Questions.pdfS3-Questions.pdf
S3-Questions.pdf
SreeGe1
 
Test Automation Tool for SAP S4HANA Cloud.pptx
Test Automation Tool for SAP S4HANA Cloud.pptxTest Automation Tool for SAP S4HANA Cloud.pptx
Test Automation Tool for SAP S4HANA Cloud.pptx
SreeGe1
 
S4H_747 How to Approach Remote Cutover (2).pptx
S4H_747 How to Approach Remote Cutover (2).pptxS4H_747 How to Approach Remote Cutover (2).pptx
S4H_747 How to Approach Remote Cutover (2).pptx
SreeGe1
 

More from SreeGe1 (8)

Lagom.pptx
Lagom.pptxLagom.pptx
Lagom.pptx
 
ASUG83511 - Accelerate Digital Transformation at General Mills.pdf
ASUG83511 - Accelerate Digital Transformation at General Mills.pdfASUG83511 - Accelerate Digital Transformation at General Mills.pdf
ASUG83511 - Accelerate Digital Transformation at General Mills.pdf
 
7204_webinar_presentation_29th_april_2020.pdf
7204_webinar_presentation_29th_april_2020.pdf7204_webinar_presentation_29th_april_2020.pdf
7204_webinar_presentation_29th_april_2020.pdf
 
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
 
EBS-Questions.pdf
EBS-Questions.pdfEBS-Questions.pdf
EBS-Questions.pdf
 
S3-Questions.pdf
S3-Questions.pdfS3-Questions.pdf
S3-Questions.pdf
 
Test Automation Tool for SAP S4HANA Cloud.pptx
Test Automation Tool for SAP S4HANA Cloud.pptxTest Automation Tool for SAP S4HANA Cloud.pptx
Test Automation Tool for SAP S4HANA Cloud.pptx
 
S4H_747 How to Approach Remote Cutover (2).pptx
S4H_747 How to Approach Remote Cutover (2).pptxS4H_747 How to Approach Remote Cutover (2).pptx
S4H_747 How to Approach Remote Cutover (2).pptx
 

Recently uploaded

Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 

Recently uploaded (20)

Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 

1.0 - AWS-DAS-Collection-Kinesis.pdf

  • 1. BIG DATA SPECIALTY CERTIFICATION AWS CERTIFIED DATA ANALYTICS SPECIALTY COURSE DATA COLLECTION
  • 2. DATA COLLECTION INTRODUCTION ØReal Time - Immediate actions Ø Kinesis Data Streams (KDS) Ø Simple Queue Service (SQS) Ø Internet of Things (IoT) ØNear-real time - Reactive actions Ø Kinesis Data Firehose (KDF) Ø Database Migration Service (DMS) ØBatch - Historical Analysis Ø Snowball ØData Pipeline
  • 4. AWS KINESIS OVERVIEW ØKinesis is a managed alternative to Apache Kafka ØGreat for application logs, metrics, IoT, clickstreams ØGreat for “real-time” big data ØGreat for streaming processing frameworks (Spark, NiFi, etc...) ØData is automatically replicated to 3 AZ ØKinesis Components Ø Kinesis Streams: low latency streaming ingest at scale Ø Kinesis Analytics: perform real-time analytics on streams using SQL Ø Kinesis Firehose: load streams into S3, Redshift, ElasticSearch …
  • 6. AWS KINESIS OVERVIEW Streams are divided in ordered Shards / Partitions ØData retention is 1 day by default, can go up to 7 days ØAbility to reprocess / replay data ØMultiple applications can consume the same stream ØReal-time processing with scale of throughput ØOnce data is inserted in Kinesis, it can’t be deleted (immutability)
  • 7. AWS KINESIS STREAMS SHARDS ØOne stream is made of many different shards ØBilling is per shard provisioned, can have as many shards as you want ØBatching available or per message calls. ØThe number of shards can evolve over time (reshard / merge) ØRecords are ordered per shard
  • 8. ØAWS Kinesis Streams Records ØData Blob: data being sent, serialized as bytes. Up to 1 MB. Can represent anything ØRecord Key: sent alongside a record, helps to group records in Shards. Same key = Same shard. Use a highly distributed key to avoid the “hot partition” problem ØSequence number: Unique identifier for each records put in shards. Added by Kinesis after ingestion AWS KINESIS STREAMS - SHARDS
  • 9. DATA ORDERING FOR KINESIS ØImagine you have 100 trucks (truck_1, truck_2, ... truck_100) on the road sending their GPS positions regularly into AWS. ØYou want to consume the data in order for each truck, so that you can track their movement accurately. ØHow should you send that data into Kinesis? ØAnswer : send using a “Partition Key” value of the “truck_id” ØThe same key will always go to the same shard
  • 10. AWS KINESIS STREAMS RECORDS ØProducer: Ø 1MB/s or 1000 messages/s at write PER SHARD Ø “ProvisionedThroughputException” otherwise ØConsumer Classic: Ø 2MB/s at read PER SHARD across all consumers Ø 5 API calls per second PER SHARD across all consumers Ø if 3 different applications are consuming, possibility of throttling ØData Retention: Ø 24 hours data retention by default Ø Can be extended to 7 days
  • 11. AWS KINESIS PRODUCERS ØKinesis SDK ØKinesis Producer Library (KPL) ØKinesis Agent ØCloudWatch Logs Ø3rd Party Libraries: Ø Spark,Log4J Appenders Ø Flume Ø Kafka Connect Ø NiFi, etc…
  • 12. AWS KINESIS PRODUCER SDK ØAPIs that are used are PutRecord (one) and PutRecords (many records) ØPutRecords uses batching and increases throughput => less HTTP requests ØProvisionedThroughputExceeded if we go over the limits Ø+ AWS Mobile SDK: Android, iOS, etc... ØUse case: low throughput, higher latency, simple API, AWS Lambda ØManaged AWS sources for Kinesis Data Streams: Ø CloudWatch Logs Ø AWS IoT Ø Kinesis Data Analytics
  • 13. AWS KINESIS API – EXCEPTIONS ØProvisionedThroughputExceeded Exceptions ØHappens when sending more data (exceeding MB/s or TPS for any shard) ØMake sure you don’t have a hot shard (such as your partition key is bad and too much data goes to that partition) ØSolution: ØRetries with backoff ØIncrease shards (scaling) ØEnsure your partition key is a good one
  • 14. KINESIS PRODUCER LIBRARY ØEasy to use and highly configurable C++ / Java library ØUsed for building high performance, long-running producers ØAutomated and configurable retry mechanism ØSynchronous or Asynchronous API (better performance for async) ØSubmits metrics to CloudWatch for monitoring ØBatching (both turned on by default) – increase throughput, decrease cost: ØCollect Records and Write to multiple shards in the same PutRecords API call ØAggregate – increased latency • Capability to store multiple records in one record (go over 1000 records per second limit) • Increase payload size and improve throughput (maximize 1MB/s limit) ØCompression must be implemented by the user ØKPL Records must be de-coded with KCL or special helper library
  • 15. KINESIS PRODUCER LIBRARY (KPL) BATCHING ØWe can influence the batching efficiency by introducing some delay with RecordMaxBufferedTime (default 100ms)
  • 16. KINESIS PRODUCER LIBRARY – WHEN NOT TO USE ØThe KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable) ØLarger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance ØApplications that cannot tolerate this additional delay may need to use the AWS SDK directly
  • 17. KINESIS AGENT ØMonitor Log files and sends them to Kinesis Data Streams ØJava-based agent, built on top of KPL ØInstall in Linux-based server environments Features: Ø Write from multiple directories and write to multiple streams Ø Routing feature based on directory / log file Ø Pre-process data before sending to streams (single line, csv to json, log to json...) Ø The agent handles file rotation, checkpointing, and retry upon failures Ø Emits metrics to CloudWatch for monitoring
  • 18. AWS KINESIS CONSUMERS ØKinesis SDK ØKinesis Client Library (KCL) ØKinesis Connector Library ØKinesis Firehose ØAWS Lambda Ø3rd party libraries: Spark, Log4J Appenders, Flume, Kafka Connect... ØKinesis Consumer Enhanced Fan
  • 19. KINESIS CONSUMER SDK - GETRECORDS ØClassic Kinesis - Records are polled by consumers from a shard ØEach shard has 2 MB total aggregate throughput ØGetRecords returns up to 10MB of data (then throttle for 5 seconds) or up to 10000 records ØMaximum of 5 GetRecords API calls per shard per second = 200ms latency ØIf 5 consumer applications consume from the same shard, means every consumer can poll once a second and receive less than 400 KB/s
  • 20. KINESIS CLIENT LIBRARY (KCL) Ø Java-first library but exists for other languages too (Golang, Python, Ruby, Node, .NET ...) Ø Read records from Kinesis produced with the KPL (de- aggregation) Ø Share multiple shards with multiple consumers in one “group”, shard discovery Ø Checkpointing feature to resume progress Ø Leverages DynamoDB for coordination and checkpointing (one row per shard) Ø Make sure you provision enough WCU / RCU Ø Or use On-Demand for DynamoDB Ø Otherwise DynamoDB may slow down KCL Ø Record processors will process the data Ø ExpiredIteratorException => increase WCU
  • 21. KINESIS CONNECTOR LIBRARY ØOlder Java library (2016), leverages the KCL library ØWrite data to: Ø Amazon S3 Ø DynamoDB Ø Redshift Ø ElasticSearch ØKinesis Firehose replaces the Connector Library for a few of these targets, Lambda for the others
  • 22. AWS LAMBDA SOURCING FROM KINESIS ØAWS Lambda can source records from Kinesis Data Streams ØLambda consumer has a library to de-aggregate record from the KPL ØLambda can be used to run lightweight ETL to: Ø Amazon S3 Ø DynamoDB Ø Redshift Ø ElasticSearch Ø Anywhere you want ØLambda can be used to trigger notifications / send emails in real time ØLambda has a configurable batch size (more in Lambda section)
  • 23. KINESIS ENHANCED FAN OUT ØNew game-changing feature from August 2018. ØWorks with KCL 2.0 and AWS Lambda (Nov 2018) ØEach Consumer get 2 MB/s of provisioned throughput per shard ØThat means 20 consumers will get 40MB/s per shard aggregated ØNo more 2 MB/s limit! ØEnhanced Fan Out: Kinesis pushes data to consumers over HTTP/2 ØReduce latency (~70 ms)
  • 24. ENHANCED FAN-OUT VS STANDARD CONSUMERS ØStandard consumers: Ø Low number of consuming applications (1,2,3...) Ø Can tolerate ~200 ms latency Ø Minimize cost ØEnhanced Fan Out Consumers: Ø Multiple Consumer applications for the same Stream Ø Low Latency requirements ~70ms Ø Higher costs (see Kinesis pricing page) Ø Default limit of 5 consumers using enhanced fan-out per data stream
  • 25. KINESIS OPERATIONS – ADDING SHARDS ØAlso called “Shard Splitting” ØCan be used to increase the Stream capacity (1 MB/s data in per shard) ØCan be used to divide a “hot shard” ØThe old shard is closed and will be deleted once the data is expired
  • 26. KINESIS OPERATIONS – MERGING SHARDS ØDecrease the Stream capacity and save costs ØCan be used to group two shards with low traffic ØOld shards are closed and deleted based on data expiration
  • 27. OUT-OF-ORDER RECORDS AFTER RESHARDING ØAfter a reshard, you can read from child shards ØHowever, data you haven’t read yet could still be in the parent ØIf you start reading the child before completing reading the parent, you could read data for a particular hash key out of order ØAfter a reshard, read entirely from the parent until you don’t have new records ØNote: The Kinesis Client Library (KCL) has this logic already built-in, even after resharding operations
  • 28. KINESIS OPERATIONS – AUTO SCALING ØAuto Scaling is not a native feature of Kinesis ØThe API call to change the number of shards is UpdateShardCount ØWe can implement Auto Scaling with AWS Lambda ØSee: Ø https://aws.amazon.com/blogs/b ig- data/scaling-amazon-kinesis- data- streams-with-aws- application-auto- scaling/
  • 29. KINESIS SCALING LIMITATIONS ØResharding cannot be done in parallel. Plan capacity in advance ØYou can only perform one resharding operation at a time and it takes a few seconds ØFor 1000 shards, it takes 30K seconds (8.3 hours) to double the shards to 2000 You can’t do the following: Ø Scale more than twice for each rolling 24-hour period for each stream Ø Scale up to more than double your current shard count for a stream Ø Scale down below half your current shard count for a stream Ø Scale up to more than 10000 shards in a stream Ø Scale a stream with more than 10000 shards down unless the result is fewer than 10000 shards Ø Scale up to more than the shard limit for your account ØIf you need to scale more than once a day, you can request amazon to increase to this limit
  • 30. KINESIS SECURITY ØControl access / authorization using IAM policies ØEncryption in flight using HTTPS endpoints ØEncryption at rest using KMS ØClient side encryption must be manually implemented (harder) ØVPC Endpoints available for Kinesis to access within VPC
  • 31. SHARDS KINESIS DATA STREAMS – HANDLING DUPLICATES FOR PRODUCERS ØProducer retries can create duplicates due to network timeouts ØAlthough the two records have identical data, they also have unique sequence number Ø Fix: embed unique record ID in the data to de-duplicate on the consumer side
  • 32. KINESIS DATA STREAMS – HANDLING DUPLICATES FOR CONSUMERS ØConsumer retries can make your application read the same data twice ØConsumer retries happen when record processors restart: Ø A worker terminates unexpectedly Ø Worker instances are added or removed Ø Shards are merged or split Ø The application is deployed ØFixes: Ø Make your consumer application idempotent Ø If the final destination can handle duplicates, it’s recommended to do it there ØMore info: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor- duplicates.html
  • 33. AWS KINESIS DATA FIREHOSE ØFully Managed Service, no administration ØNear Real Time (60 seconds latency minimum for non full batches) ØLoad data into Redshift / Amazon S3 / ElasticSearch / Splunk ØAutomatic scaling ØSupports many data formats ØData Conversions from JSON to Parquet / ORC (only for S3) ØData Transformation through AWS Lambda (ex: CSV => JSON) ØSupports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY) ØOnly GZIP is the data is further loaded into Redshift ØSpark / KCL do not read from KDF ØPay for the amount of data going through Firehose
  • 34. AWS KINESIS DATA FIREHOSE DIAGRAM
  • 35. KINESIS DATA FIREHOSE DELIVERY DIAGRAM
  • 36. FIREHOSE BUFFER SIZING ØFirehose accumulates records in a buffer ØThe buffer is flushed based on time and size rules ØBuffer Size (ex: 32MB): if that buffer size is reached, it’s flushed ØBuffer Time (ex: 2 minutes): if that time is reached, it’s flushed ØFirehose can automatically increase the buffer size to increase throughput ØHigh throughput => Buffer Size will be hit ØLow throughput => Buffer Time will be hit
  • 37. AWS KINESIS DATA STREAMS VS FIREHOSE Streams ĂŹ Going to write custom code (producer / consumer) ĂŹ Real time (~200 ms latency for classic) ĂŹ Must manage scaling (shard splitting / merging) ĂŹ Data Storage for 1 to 7 days, replay capability, multi consumers ĂŹ Use with Lambda to insert data in real-time to ElasticSearch (for example) Firehose ĂŹ Fully managed, send to S3, Splunk, Redshift, ElasticSearch ĂŹ Serverless data transformations with Lambda ĂŹ Near real time (lowest buffer time is 1 minute) ĂŹ Automated Scaling ĂŹ No data storage
  • 38. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS ØYou can stream CloudWatch Logs into Ø Kinesis Data Streams Ø Kinesis Data Firehose Ø AWS Lambda Ø Using CloudWatch Logs Subscriptions Filters ØYou can enable them using the AWS CLI
  • 39. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS NEAR REAL TIME INTO AMAZON ES
  • 40. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME INTO AMAZON ES
  • 41. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME ANALYTICS
  • 42. USE CASE – EXERCISE 1 – DATA COLLECTION • Creating a Kinesis Firehose delivery stream • Generate OrderHistory csv file using a LogGenerator Python script • Publishing the data to an S3 bucket from firehose using Kinesis Agent • Create a Kinesis Data Stream • Publish data from the Kinesis agent to Kinesis Data stream