SlideShare a Scribd company logo
1 of 42
Download to read offline
BIG DATA SPECIALTY
CERTIFICATION
AWS CERTIFIED DATA ANALYTICS SPECIALTY COURSE
DATA COLLECTION
DATA COLLECTION INTRODUCTION
ØReal Time - Immediate actions
Ø Kinesis Data Streams (KDS)
Ø Simple Queue Service (SQS)
Ø Internet of Things (IoT)
ØNear-real time - Reactive actions
Ø Kinesis Data Firehose (KDF)
Ø Database Migration Service (DMS)
ØBatch - Historical Analysis
Ø Snowball
ØData Pipeline
AWS KINESIS
AWS KINESIS OVERVIEW
ØKinesis is a managed alternative to Apache Kafka
ØGreat for application logs, metrics, IoT, clickstreams
ØGreat for “real-time” big data
ØGreat for streaming processing frameworks (Spark, NiFi, etc...)
ØData is automatically replicated to 3 AZ
ØKinesis Components
Ø Kinesis Streams: low latency streaming ingest at scale
Ø Kinesis Analytics: perform real-time analytics on streams using SQL
Ø Kinesis Firehose: load streams into S3, Redshift, ElasticSearch …
AWS KINESIS EXAMPLE
AWS KINESIS OVERVIEW
Streams are divided in ordered Shards / Partitions
ØData retention is 1 day by default, can go up to 7 days
ØAbility to reprocess / replay data
ØMultiple applications can consume the same stream
ØReal-time processing with scale of throughput
ØOnce data is inserted in Kinesis, it can’t be deleted (immutability)
AWS KINESIS STREAMS SHARDS
ØOne stream is made of many different shards
ØBilling is per shard provisioned, can have as many shards as you want
ØBatching available or per message calls.
ØThe number of shards can evolve over time (reshard / merge)
ØRecords are ordered per shard
ØAWS Kinesis Streams Records
ØData Blob: data being sent, serialized
as bytes. Up to 1 MB. Can represent
anything
ØRecord Key:
sent alongside a record, helps to
group records in Shards. Same key =
Same shard.
Use a highly distributed key to
avoid the “hot partition” problem
ØSequence number: Unique identifier
for each records put in shards.
Added by Kinesis after ingestion
AWS KINESIS STREAMS - SHARDS
DATA ORDERING FOR KINESIS
ØImagine you have 100 trucks (truck_1,
truck_2, ... truck_100) on the road
sending their GPS positions regularly into
AWS.
ØYou want to consume the data in order for
each truck, so that you can track their
movement accurately.
ØHow should you send that data into
Kinesis?
ØAnswer : send using a “Partition Key”
value of the “truck_id”
ØThe same key will always go to the
same shard
AWS KINESIS STREAMS RECORDS
ØProducer:
Ø 1MB/s or 1000 messages/s at write PER SHARD
Ø “ProvisionedThroughputException” otherwise
ØConsumer Classic:
Ø 2MB/s at read PER SHARD across all consumers
Ø 5 API calls per second PER SHARD across all consumers
Ø if 3 different applications are consuming, possibility of throttling
ØData Retention:
Ø 24 hours data retention by default
Ø Can be extended to 7 days
AWS KINESIS PRODUCERS
ØKinesis SDK
ØKinesis Producer Library (KPL)
ØKinesis Agent
ØCloudWatch Logs
Ø3rd Party Libraries:
Ø Spark,Log4J Appenders
Ø Flume
Ø Kafka Connect
Ø NiFi, etc…
AWS KINESIS PRODUCER SDK
ØAPIs that are used are PutRecord (one) and PutRecords (many records)
ØPutRecords uses batching and increases throughput => less HTTP requests
ØProvisionedThroughputExceeded if we go over the limits
Ø+ AWS Mobile SDK: Android, iOS, etc...
ØUse case: low throughput, higher latency, simple API, AWS Lambda
ØManaged AWS sources for Kinesis Data Streams:
Ø CloudWatch Logs
Ø AWS IoT
Ø Kinesis Data Analytics
AWS KINESIS API – EXCEPTIONS
ØProvisionedThroughputExceeded Exceptions
ØHappens when sending more data (exceeding MB/s or TPS for any shard)
ØMake sure you don’t have a hot shard (such as your partition key is bad and
too much data goes to that partition)
ØSolution:
ØRetries with backoff
ØIncrease shards (scaling)
ØEnsure your partition key is a good one
KINESIS PRODUCER LIBRARY
ØEasy to use and highly configurable C++ / Java library
ØUsed for building high performance, long-running producers
ØAutomated and configurable retry mechanism
ØSynchronous or Asynchronous API (better performance for async)
ØSubmits metrics to CloudWatch for monitoring
ØBatching (both turned on by default) – increase throughput, decrease cost:
ØCollect Records and Write to multiple shards in the same PutRecords API call
ØAggregate – increased latency
• Capability to store multiple records in one record (go over 1000 records per
second limit)
• Increase payload size and improve throughput (maximize 1MB/s limit)
ØCompression must be implemented by the user
ØKPL Records must be de-coded with KCL or special helper library
KINESIS PRODUCER LIBRARY (KPL) BATCHING
ØWe can influence the batching efficiency by introducing some delay with
RecordMaxBufferedTime (default 100ms)
KINESIS PRODUCER LIBRARY – WHEN NOT TO USE
ØThe KPL can incur an additional processing delay of up to RecordMaxBufferedTime within
the library (user-configurable)
ØLarger values of RecordMaxBufferedTime results in higher packing efficiencies and better
performance
ØApplications that cannot tolerate this additional delay may need to use the AWS SDK
directly
KINESIS AGENT
ØMonitor Log files and sends them to Kinesis Data Streams
ØJava-based agent, built on top of KPL
ØInstall in Linux-based server environments
Features:
Ø Write from multiple directories and write to multiple streams
Ø Routing feature based on directory / log file
Ø Pre-process data before sending to streams (single line, csv to json, log to json...)
Ø The agent handles file rotation, checkpointing, and retry upon failures
Ø Emits metrics to CloudWatch for monitoring
AWS KINESIS CONSUMERS
ØKinesis SDK
ØKinesis Client Library (KCL)
ØKinesis Connector Library
ØKinesis Firehose
ØAWS Lambda
Ø3rd party libraries: Spark,
Log4J Appenders, Flume,
Kafka Connect...
ØKinesis Consumer Enhanced
Fan
KINESIS CONSUMER SDK - GETRECORDS
ØClassic Kinesis - Records are polled
by consumers from a shard
ØEach shard has 2 MB total
aggregate throughput
ØGetRecords returns up to 10MB of
data (then throttle for 5 seconds) or
up to 10000 records
ØMaximum of 5 GetRecords API calls
per shard per second = 200ms
latency
ØIf 5 consumer applications consume
from the same shard, means every
consumer can poll once a second
and receive less than 400 KB/s
KINESIS CLIENT LIBRARY (KCL)
Ø Java-first library but exists for other languages too
(Golang, Python, Ruby, Node, .NET ...)
Ø Read records from Kinesis produced with the KPL (de-
aggregation)
Ø Share multiple shards with multiple consumers in one
“group”, shard discovery
Ø Checkpointing feature to resume progress
Ø Leverages DynamoDB for coordination and
checkpointing (one row per shard)
Ø Make sure you provision enough WCU / RCU
Ø Or use On-Demand for DynamoDB
Ø Otherwise DynamoDB may slow down KCL
Ø Record processors will process the data
Ø ExpiredIteratorException => increase WCU
KINESIS CONNECTOR LIBRARY
ØOlder Java library (2016), leverages
the KCL library
ØWrite data to:
Ø Amazon S3
Ø DynamoDB
Ø Redshift
Ø ElasticSearch
ØKinesis Firehose replaces the
Connector Library for a few of
these targets, Lambda for the
others
AWS LAMBDA SOURCING FROM KINESIS
ØAWS Lambda can source records from Kinesis Data Streams
ØLambda consumer has a library to de-aggregate record from the KPL
ØLambda can be used to run lightweight ETL to:
Ø Amazon S3
Ø DynamoDB
Ø Redshift
Ø ElasticSearch
Ø Anywhere you want
ØLambda can be used to trigger notifications / send emails in real time
ØLambda has a configurable batch size (more in Lambda section)
KINESIS ENHANCED FAN OUT
ØNew game-changing feature from August 2018.
ØWorks with KCL 2.0 and AWS Lambda (Nov 2018)
ØEach Consumer get 2 MB/s of provisioned
throughput per shard
ØThat means 20 consumers will get 40MB/s per
shard aggregated
ØNo more 2 MB/s limit!
ØEnhanced Fan Out: Kinesis pushes data to
consumers over HTTP/2
ØReduce latency (~70 ms)
ENHANCED FAN-OUT VS STANDARD CONSUMERS
ØStandard consumers:
Ø Low number of consuming applications (1,2,3...)
Ø Can tolerate ~200 ms latency
Ø Minimize cost
ØEnhanced Fan Out Consumers:
Ø Multiple Consumer applications for the same Stream
Ø Low Latency requirements ~70ms
Ø Higher costs (see Kinesis pricing page)
Ø Default limit of 5 consumers using enhanced fan-out per data stream
KINESIS OPERATIONS – ADDING SHARDS
ØAlso called “Shard Splitting”
ØCan be used to increase the Stream capacity (1 MB/s data in per shard)
ØCan be used to divide a “hot shard”
ØThe old shard is closed and will be deleted once the data is expired
KINESIS OPERATIONS – MERGING SHARDS
ØDecrease the Stream capacity and save costs
ØCan be used to group two shards with low traffic
ØOld shards are closed and deleted based on data expiration
OUT-OF-ORDER RECORDS AFTER RESHARDING
ØAfter a reshard, you can read from child
shards
ØHowever, data you haven’t read yet could
still be in the parent
ØIf you start reading the child before
completing reading the parent, you could
read data for a particular hash key out of
order
ØAfter a reshard, read entirely from the
parent until you don’t have new records
ØNote: The Kinesis Client Library (KCL) has
this logic already built-in, even after
resharding operations
KINESIS OPERATIONS – AUTO SCALING
ØAuto Scaling is not a native feature of
Kinesis
ØThe API call to change the number of
shards is UpdateShardCount
ØWe can implement Auto Scaling with AWS
Lambda
ØSee:
Ø https://aws.amazon.com/blogs/b ig-
data/scaling-amazon-kinesis- data-
streams-with-aws- application-auto-
scaling/
KINESIS SCALING LIMITATIONS
ØResharding cannot be done in parallel. Plan capacity in advance
ØYou can only perform one resharding operation at a time and it takes a few seconds
ØFor 1000 shards, it takes 30K seconds (8.3 hours) to double the shards to 2000
You can’t do the following:
Ø Scale more than twice for each rolling 24-hour period for each stream
Ø Scale up to more than double your current shard count for a stream
Ø Scale down below half your current shard count for a stream
Ø Scale up to more than 10000 shards in a stream
Ø Scale a stream with more than 10000 shards down unless the result is fewer than 10000
shards
Ø Scale up to more than the shard limit for your account
ØIf you need to scale more than once a day, you can request amazon to increase to this
limit
KINESIS SECURITY
ØControl access / authorization using IAM policies
ØEncryption in flight using HTTPS endpoints
ØEncryption at rest using KMS
ØClient side encryption must be manually implemented (harder)
ØVPC Endpoints available for Kinesis to access within VPC
SHARDS KINESIS DATA STREAMS – HANDLING DUPLICATES FOR
PRODUCERS
ØProducer retries can create duplicates due to network timeouts
ØAlthough the two records have identical data, they also have unique sequence number
Ø Fix: embed unique record ID in the data to de-duplicate on the consumer side
KINESIS DATA STREAMS – HANDLING DUPLICATES
FOR CONSUMERS
ØConsumer retries can make your application read the same data twice
ØConsumer retries happen when record processors restart:
Ø A worker terminates unexpectedly
Ø Worker instances are added or removed
Ø Shards are merged or split
Ø The application is deployed
ØFixes:
Ø Make your consumer application idempotent
Ø If the final destination can handle duplicates, it’s recommended to do it there
ØMore info: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-
duplicates.html
AWS KINESIS DATA FIREHOSE
ØFully Managed Service, no administration
ØNear Real Time (60 seconds latency minimum for non full batches)
ØLoad data into Redshift / Amazon S3 / ElasticSearch / Splunk
ØAutomatic scaling
ØSupports many data formats
ØData Conversions from JSON to Parquet / ORC (only for S3)
ØData Transformation through AWS Lambda (ex: CSV => JSON)
ØSupports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY)
ØOnly GZIP is the data is further loaded into Redshift
ØSpark / KCL do not read from KDF
ØPay for the amount of data going through Firehose
AWS KINESIS DATA FIREHOSE DIAGRAM
KINESIS DATA FIREHOSE DELIVERY DIAGRAM
FIREHOSE BUFFER SIZING
ØFirehose accumulates records in a buffer
ØThe buffer is flushed based on time and size rules
ØBuffer Size (ex: 32MB): if that buffer size is reached, it’s flushed
ØBuffer Time (ex: 2 minutes): if that time is reached, it’s flushed
ØFirehose can automatically increase the buffer size to increase throughput
ØHigh throughput => Buffer Size will be hit
ØLow throughput => Buffer Time will be hit
AWS KINESIS DATA STREAMS VS FIREHOSE
Streams
ì Going to write custom code (producer / consumer)
ì Real time (~200 ms latency for classic)
ì Must manage scaling (shard splitting / merging)
ì Data Storage for 1 to 7 days, replay capability, multi consumers
ì Use with Lambda to insert data in real-time to ElasticSearch (for example)
Firehose
ì Fully managed, send to S3, Splunk, Redshift, ElasticSearch
ì Serverless data transformations with Lambda
ì Near real time (lowest buffer time is 1 minute)
ì Automated Scaling
ì No data storage
CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS
ØYou can stream CloudWatch Logs into
Ø Kinesis Data Streams
Ø Kinesis Data Firehose
Ø AWS Lambda
Ø Using CloudWatch Logs Subscriptions Filters
ØYou can enable them using the AWS CLI
CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS NEAR REAL
TIME INTO AMAZON ES
CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME INTO
AMAZON ES
CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME ANALYTICS
USE CASE – EXERCISE 1 – DATA COLLECTION
• Creating a Kinesis Firehose delivery stream
• Generate OrderHistory csv file using a LogGenerator Python script
• Publishing the data to an S3 bucket from firehose using Kinesis Agent
• Create a Kinesis Data Stream
• Publish data from the Kinesis agent to Kinesis Data stream

More Related Content

Similar to 1.0 - AWS-DAS-Collection-Kinesis.pdf

찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Amazon Web Services
 

Similar to 1.0 - AWS-DAS-Collection-Kinesis.pdf (20)

Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
찾아가는 AWS 세미나(구로,가산,판교) - AWS 기반 빅데이터 활용 방법 (김일호 솔루션즈 아키텍트)
 
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...
 
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Em tempo real: Ingestão, processamento e analise de dados
Em tempo real: Ingestão, processamento e analise de dadosEm tempo real: Ingestão, processamento e analise de dados
Em tempo real: Ingestão, processamento e analise de dados
 
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Raleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS LambdaRaleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS Lambda
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 
Randall's re:Invent Recap
Randall's re:Invent RecapRandall's re:Invent Recap
Randall's re:Invent Recap
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul MaddoxAWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
 
AWS Kinesis
AWS KinesisAWS Kinesis
AWS Kinesis
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?2017 AWS DB Day |  AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
2017 AWS DB Day | AWS 데이터베이스 개요 - 나의 업무에 적합한 데이터베이스는?
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 

More from SreeGe1

More from SreeGe1 (8)

Lagom.pptx
Lagom.pptxLagom.pptx
Lagom.pptx
 
ASUG83511 - Accelerate Digital Transformation at General Mills.pdf
ASUG83511 - Accelerate Digital Transformation at General Mills.pdfASUG83511 - Accelerate Digital Transformation at General Mills.pdf
ASUG83511 - Accelerate Digital Transformation at General Mills.pdf
 
7204_webinar_presentation_29th_april_2020.pdf
7204_webinar_presentation_29th_april_2020.pdf7204_webinar_presentation_29th_april_2020.pdf
7204_webinar_presentation_29th_april_2020.pdf
 
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
Data Migration Tools for the MOVE to SAP S_4HANA - Comparison_ MC _ RDM _ LSM...
 
EBS-Questions.pdf
EBS-Questions.pdfEBS-Questions.pdf
EBS-Questions.pdf
 
S3-Questions.pdf
S3-Questions.pdfS3-Questions.pdf
S3-Questions.pdf
 
Test Automation Tool for SAP S4HANA Cloud.pptx
Test Automation Tool for SAP S4HANA Cloud.pptxTest Automation Tool for SAP S4HANA Cloud.pptx
Test Automation Tool for SAP S4HANA Cloud.pptx
 
S4H_747 How to Approach Remote Cutover (2).pptx
S4H_747 How to Approach Remote Cutover (2).pptxS4H_747 How to Approach Remote Cutover (2).pptx
S4H_747 How to Approach Remote Cutover (2).pptx
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

1.0 - AWS-DAS-Collection-Kinesis.pdf

  • 1. BIG DATA SPECIALTY CERTIFICATION AWS CERTIFIED DATA ANALYTICS SPECIALTY COURSE DATA COLLECTION
  • 2. DATA COLLECTION INTRODUCTION ØReal Time - Immediate actions Ø Kinesis Data Streams (KDS) Ø Simple Queue Service (SQS) Ø Internet of Things (IoT) ØNear-real time - Reactive actions Ø Kinesis Data Firehose (KDF) Ø Database Migration Service (DMS) ØBatch - Historical Analysis Ø Snowball ØData Pipeline
  • 4. AWS KINESIS OVERVIEW ØKinesis is a managed alternative to Apache Kafka ØGreat for application logs, metrics, IoT, clickstreams ØGreat for “real-time” big data ØGreat for streaming processing frameworks (Spark, NiFi, etc...) ØData is automatically replicated to 3 AZ ØKinesis Components Ø Kinesis Streams: low latency streaming ingest at scale Ø Kinesis Analytics: perform real-time analytics on streams using SQL Ø Kinesis Firehose: load streams into S3, Redshift, ElasticSearch …
  • 6. AWS KINESIS OVERVIEW Streams are divided in ordered Shards / Partitions ØData retention is 1 day by default, can go up to 7 days ØAbility to reprocess / replay data ØMultiple applications can consume the same stream ØReal-time processing with scale of throughput ØOnce data is inserted in Kinesis, it can’t be deleted (immutability)
  • 7. AWS KINESIS STREAMS SHARDS ØOne stream is made of many different shards ØBilling is per shard provisioned, can have as many shards as you want ØBatching available or per message calls. ØThe number of shards can evolve over time (reshard / merge) ØRecords are ordered per shard
  • 8. ØAWS Kinesis Streams Records ØData Blob: data being sent, serialized as bytes. Up to 1 MB. Can represent anything ØRecord Key: sent alongside a record, helps to group records in Shards. Same key = Same shard. Use a highly distributed key to avoid the “hot partition” problem ØSequence number: Unique identifier for each records put in shards. Added by Kinesis after ingestion AWS KINESIS STREAMS - SHARDS
  • 9. DATA ORDERING FOR KINESIS ØImagine you have 100 trucks (truck_1, truck_2, ... truck_100) on the road sending their GPS positions regularly into AWS. ØYou want to consume the data in order for each truck, so that you can track their movement accurately. ØHow should you send that data into Kinesis? ØAnswer : send using a “Partition Key” value of the “truck_id” ØThe same key will always go to the same shard
  • 10. AWS KINESIS STREAMS RECORDS ØProducer: Ø 1MB/s or 1000 messages/s at write PER SHARD Ø “ProvisionedThroughputException” otherwise ØConsumer Classic: Ø 2MB/s at read PER SHARD across all consumers Ø 5 API calls per second PER SHARD across all consumers Ø if 3 different applications are consuming, possibility of throttling ØData Retention: Ø 24 hours data retention by default Ø Can be extended to 7 days
  • 11. AWS KINESIS PRODUCERS ØKinesis SDK ØKinesis Producer Library (KPL) ØKinesis Agent ØCloudWatch Logs Ø3rd Party Libraries: Ø Spark,Log4J Appenders Ø Flume Ø Kafka Connect Ø NiFi, etc…
  • 12. AWS KINESIS PRODUCER SDK ØAPIs that are used are PutRecord (one) and PutRecords (many records) ØPutRecords uses batching and increases throughput => less HTTP requests ØProvisionedThroughputExceeded if we go over the limits Ø+ AWS Mobile SDK: Android, iOS, etc... ØUse case: low throughput, higher latency, simple API, AWS Lambda ØManaged AWS sources for Kinesis Data Streams: Ø CloudWatch Logs Ø AWS IoT Ø Kinesis Data Analytics
  • 13. AWS KINESIS API – EXCEPTIONS ØProvisionedThroughputExceeded Exceptions ØHappens when sending more data (exceeding MB/s or TPS for any shard) ØMake sure you don’t have a hot shard (such as your partition key is bad and too much data goes to that partition) ØSolution: ØRetries with backoff ØIncrease shards (scaling) ØEnsure your partition key is a good one
  • 14. KINESIS PRODUCER LIBRARY ØEasy to use and highly configurable C++ / Java library ØUsed for building high performance, long-running producers ØAutomated and configurable retry mechanism ØSynchronous or Asynchronous API (better performance for async) ØSubmits metrics to CloudWatch for monitoring ØBatching (both turned on by default) – increase throughput, decrease cost: ØCollect Records and Write to multiple shards in the same PutRecords API call ØAggregate – increased latency • Capability to store multiple records in one record (go over 1000 records per second limit) • Increase payload size and improve throughput (maximize 1MB/s limit) ØCompression must be implemented by the user ØKPL Records must be de-coded with KCL or special helper library
  • 15. KINESIS PRODUCER LIBRARY (KPL) BATCHING ØWe can influence the batching efficiency by introducing some delay with RecordMaxBufferedTime (default 100ms)
  • 16. KINESIS PRODUCER LIBRARY – WHEN NOT TO USE ØThe KPL can incur an additional processing delay of up to RecordMaxBufferedTime within the library (user-configurable) ØLarger values of RecordMaxBufferedTime results in higher packing efficiencies and better performance ØApplications that cannot tolerate this additional delay may need to use the AWS SDK directly
  • 17. KINESIS AGENT ØMonitor Log files and sends them to Kinesis Data Streams ØJava-based agent, built on top of KPL ØInstall in Linux-based server environments Features: Ø Write from multiple directories and write to multiple streams Ø Routing feature based on directory / log file Ø Pre-process data before sending to streams (single line, csv to json, log to json...) Ø The agent handles file rotation, checkpointing, and retry upon failures Ø Emits metrics to CloudWatch for monitoring
  • 18. AWS KINESIS CONSUMERS ØKinesis SDK ØKinesis Client Library (KCL) ØKinesis Connector Library ØKinesis Firehose ØAWS Lambda Ø3rd party libraries: Spark, Log4J Appenders, Flume, Kafka Connect... ØKinesis Consumer Enhanced Fan
  • 19. KINESIS CONSUMER SDK - GETRECORDS ØClassic Kinesis - Records are polled by consumers from a shard ØEach shard has 2 MB total aggregate throughput ØGetRecords returns up to 10MB of data (then throttle for 5 seconds) or up to 10000 records ØMaximum of 5 GetRecords API calls per shard per second = 200ms latency ØIf 5 consumer applications consume from the same shard, means every consumer can poll once a second and receive less than 400 KB/s
  • 20. KINESIS CLIENT LIBRARY (KCL) Ø Java-first library but exists for other languages too (Golang, Python, Ruby, Node, .NET ...) Ø Read records from Kinesis produced with the KPL (de- aggregation) Ø Share multiple shards with multiple consumers in one “group”, shard discovery Ø Checkpointing feature to resume progress Ø Leverages DynamoDB for coordination and checkpointing (one row per shard) Ø Make sure you provision enough WCU / RCU Ø Or use On-Demand for DynamoDB Ø Otherwise DynamoDB may slow down KCL Ø Record processors will process the data Ø ExpiredIteratorException => increase WCU
  • 21. KINESIS CONNECTOR LIBRARY ØOlder Java library (2016), leverages the KCL library ØWrite data to: Ø Amazon S3 Ø DynamoDB Ø Redshift Ø ElasticSearch ØKinesis Firehose replaces the Connector Library for a few of these targets, Lambda for the others
  • 22. AWS LAMBDA SOURCING FROM KINESIS ØAWS Lambda can source records from Kinesis Data Streams ØLambda consumer has a library to de-aggregate record from the KPL ØLambda can be used to run lightweight ETL to: Ø Amazon S3 Ø DynamoDB Ø Redshift Ø ElasticSearch Ø Anywhere you want ØLambda can be used to trigger notifications / send emails in real time ØLambda has a configurable batch size (more in Lambda section)
  • 23. KINESIS ENHANCED FAN OUT ØNew game-changing feature from August 2018. ØWorks with KCL 2.0 and AWS Lambda (Nov 2018) ØEach Consumer get 2 MB/s of provisioned throughput per shard ØThat means 20 consumers will get 40MB/s per shard aggregated ØNo more 2 MB/s limit! ØEnhanced Fan Out: Kinesis pushes data to consumers over HTTP/2 ØReduce latency (~70 ms)
  • 24. ENHANCED FAN-OUT VS STANDARD CONSUMERS ØStandard consumers: Ø Low number of consuming applications (1,2,3...) Ø Can tolerate ~200 ms latency Ø Minimize cost ØEnhanced Fan Out Consumers: Ø Multiple Consumer applications for the same Stream Ø Low Latency requirements ~70ms Ø Higher costs (see Kinesis pricing page) Ø Default limit of 5 consumers using enhanced fan-out per data stream
  • 25. KINESIS OPERATIONS – ADDING SHARDS ØAlso called “Shard Splitting” ØCan be used to increase the Stream capacity (1 MB/s data in per shard) ØCan be used to divide a “hot shard” ØThe old shard is closed and will be deleted once the data is expired
  • 26. KINESIS OPERATIONS – MERGING SHARDS ØDecrease the Stream capacity and save costs ØCan be used to group two shards with low traffic ØOld shards are closed and deleted based on data expiration
  • 27. OUT-OF-ORDER RECORDS AFTER RESHARDING ØAfter a reshard, you can read from child shards ØHowever, data you haven’t read yet could still be in the parent ØIf you start reading the child before completing reading the parent, you could read data for a particular hash key out of order ØAfter a reshard, read entirely from the parent until you don’t have new records ØNote: The Kinesis Client Library (KCL) has this logic already built-in, even after resharding operations
  • 28. KINESIS OPERATIONS – AUTO SCALING ØAuto Scaling is not a native feature of Kinesis ØThe API call to change the number of shards is UpdateShardCount ØWe can implement Auto Scaling with AWS Lambda ØSee: Ø https://aws.amazon.com/blogs/b ig- data/scaling-amazon-kinesis- data- streams-with-aws- application-auto- scaling/
  • 29. KINESIS SCALING LIMITATIONS ØResharding cannot be done in parallel. Plan capacity in advance ØYou can only perform one resharding operation at a time and it takes a few seconds ØFor 1000 shards, it takes 30K seconds (8.3 hours) to double the shards to 2000 You can’t do the following: Ø Scale more than twice for each rolling 24-hour period for each stream Ø Scale up to more than double your current shard count for a stream Ø Scale down below half your current shard count for a stream Ø Scale up to more than 10000 shards in a stream Ø Scale a stream with more than 10000 shards down unless the result is fewer than 10000 shards Ø Scale up to more than the shard limit for your account ØIf you need to scale more than once a day, you can request amazon to increase to this limit
  • 30. KINESIS SECURITY ØControl access / authorization using IAM policies ØEncryption in flight using HTTPS endpoints ØEncryption at rest using KMS ØClient side encryption must be manually implemented (harder) ØVPC Endpoints available for Kinesis to access within VPC
  • 31. SHARDS KINESIS DATA STREAMS – HANDLING DUPLICATES FOR PRODUCERS ØProducer retries can create duplicates due to network timeouts ØAlthough the two records have identical data, they also have unique sequence number Ø Fix: embed unique record ID in the data to de-duplicate on the consumer side
  • 32. KINESIS DATA STREAMS – HANDLING DUPLICATES FOR CONSUMERS ØConsumer retries can make your application read the same data twice ØConsumer retries happen when record processors restart: Ø A worker terminates unexpectedly Ø Worker instances are added or removed Ø Shards are merged or split Ø The application is deployed ØFixes: Ø Make your consumer application idempotent Ø If the final destination can handle duplicates, it’s recommended to do it there ØMore info: https://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor- duplicates.html
  • 33. AWS KINESIS DATA FIREHOSE ØFully Managed Service, no administration ØNear Real Time (60 seconds latency minimum for non full batches) ØLoad data into Redshift / Amazon S3 / ElasticSearch / Splunk ØAutomatic scaling ØSupports many data formats ØData Conversions from JSON to Parquet / ORC (only for S3) ØData Transformation through AWS Lambda (ex: CSV => JSON) ØSupports compression when target is Amazon S3 (GZIP, ZIP, and SNAPPY) ØOnly GZIP is the data is further loaded into Redshift ØSpark / KCL do not read from KDF ØPay for the amount of data going through Firehose
  • 34. AWS KINESIS DATA FIREHOSE DIAGRAM
  • 35. KINESIS DATA FIREHOSE DELIVERY DIAGRAM
  • 36. FIREHOSE BUFFER SIZING ØFirehose accumulates records in a buffer ØThe buffer is flushed based on time and size rules ØBuffer Size (ex: 32MB): if that buffer size is reached, it’s flushed ØBuffer Time (ex: 2 minutes): if that time is reached, it’s flushed ØFirehose can automatically increase the buffer size to increase throughput ØHigh throughput => Buffer Size will be hit ØLow throughput => Buffer Time will be hit
  • 37. AWS KINESIS DATA STREAMS VS FIREHOSE Streams ì Going to write custom code (producer / consumer) ì Real time (~200 ms latency for classic) ì Must manage scaling (shard splitting / merging) ì Data Storage for 1 to 7 days, replay capability, multi consumers ì Use with Lambda to insert data in real-time to ElasticSearch (for example) Firehose ì Fully managed, send to S3, Splunk, Redshift, ElasticSearch ì Serverless data transformations with Lambda ì Near real time (lowest buffer time is 1 minute) ì Automated Scaling ì No data storage
  • 38. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS ØYou can stream CloudWatch Logs into Ø Kinesis Data Streams Ø Kinesis Data Firehose Ø AWS Lambda Ø Using CloudWatch Logs Subscriptions Filters ØYou can enable them using the AWS CLI
  • 39. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS NEAR REAL TIME INTO AMAZON ES
  • 40. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME INTO AMAZON ES
  • 41. CLOUDWATCH LOGS SUBSCRIPTIONS FILTERS PATTERNS REAL TIME ANALYTICS
  • 42. USE CASE – EXERCISE 1 – DATA COLLECTION • Creating a Kinesis Firehose delivery stream • Generate OrderHistory csv file using a LogGenerator Python script • Publishing the data to an S3 bucket from firehose using Kinesis Agent • Create a Kinesis Data Stream • Publish data from the Kinesis agent to Kinesis Data stream