Serverless Data
Streaming at Scale
Anahit Pogosova
Lead Cloud Software Engineer (Solita / Yle)
20.10.2020
o Who, What & Why?
o Under the Hood
o Gotchas and Lessons Learned
• Lead Cloud Software Engineer
• Part of the Data & AI team at
Finland’s national public
broadcasting company
Me
• AWS Community Builder
10+ years
”full stack”, all kinds of stuff
• Yle Areena, the biggest
streaming service in Finland
• Areena recommendations
• Areena image personalisation
• Automatic image extraction
• Article recommendations (yle.fi)
• Smart notifications (Yle Uutisvahti)
• .. and more
Yle
• Data!
• user interaction and content metadata
• Collecting the data
• Storing the data
• Visualizing the data
• Utilizing the data (ML & AI)
• To understand the customers
• To help provide better service for everyone
{
"adobe":true,
"is_heartbeat":true,
"collectorreceived":1555267233549,
...
"s:asset:name":"Yle TV1",
"s:event:type":"start",
"s:meta:category":"nettitv",
"s:meta:content_type":"livetv",
"s:meta:ns_st_st":"yle tv1",
...
"s:meta:title":"eduskuntavaalit 2019 - tulosilta",
...
"s:meta:yle.vrsContent":"video",
"s:meta:yle.vrsDevice":"android",
"s:meta:yle.vrsPlatform":"mobile",
"s:meta:yle.vrsProduct":"areena",
"s:meta:yle_client":"android.areena.481-b4ce224bf",
"s:meta:yle_language":"fi",
"s:sp:channel":"yleisradio",
"s:sp:hb_version":"android-2.2.1.214-d5c678",
"s:user:mid":"71057009616815049761612335654599557361"
}
Yle
• ~ 500 000 000 requests per day
• ~ 600 000 rpm during prime time
• > 0.5 TB event data per day
• Apache Parquet
• JSON
• Max so far: ~ 2.5 mln rpm
• elections + hockey finals
Yle
Under the Hood
Agenda
Load the data to the datalake in a
columnar format
Enable content personalization
through near real-time analytics
No server is easier to manage than
“no server”.
Dr. Werner Vogels
(CTO, Amazon)
Kinesis Data Streams
• Fully managed and massively scalable service to stream data
• Data available in milliseconds and stored from 24 hours to up to 7
days
• Custom stream processing with consumers
• Shard is the unit of parallelism
• In: 1 MB/sec or 1 000 records/sec
• Out: 2 MB/sec
Amazon Kinesis
Agent
• Stand-alone Java
application to
stream data from
files
Service
Integrations
• CloudWatch Logs
• CloudWatch Events
• AWS IoT
• DB Migration Service
• API Gateway
Amazon Kinesis
Producer Library
(KPL)
• Provides higher
level of abstraction
over API calls
Amazon Kinesis
API (AWS SDK)
• Most flexible
• Allows full control
over writing data
Kinesis Data Streams, Writing Data
• putRecord(params, callback)
• putRecords(params, callback)
• Up to 500 records
• Up to 5 MiB
Kinesis Data Streams, AWS SDK
putRecords(params, callback)
• Request failure
• Retries by default up to 3 times
• Uses exponential backoff
• Base delay by default is 100 ms
Kinesis Data Streams, AWS SDK
Lambda
o One Lambda is invoked per each shard by default
• NEW(ish)! Parallelization factor (max 10)
• Up to 10 times as many concurrent Lambdas as there are shards!
o Lambda is invoked once per second,
or:
• the number of records reaches the configured batch size
(max 10 000 records)
• the record batch size reaches synchronous Lambda’s payload limit
(6MB)
• NEW(ish)! the batch window reaches its maximum value
(max 5 min)
Lambda
Before..
• Lambda retries the batch until
success or data expiration
• No other batches are
processed from the
shard (aka poison pill)!
Lambda, Error Handling
After!
• Maximum retry attempts
(max 10 000)
• Maximum record age
(1 min – 7 days)
• Bisect batch on function failure
• On-failure destination
(SQS or SNS)
Agenda
Load the data to the datalake in a
columnar format
Enable content personalization
through near real-time analytics
• Fully managed service to load streaming
data into a data lake
• S3, Redshift, AWS Elasticsearch
• HTTP endpoints (New!)
• Datadog, New Relic, MongoDB, and Splunk (Newish!)
• Allows to load streaming data with 0 lines of code
• Scales automatically (no shards to manage)
• Can batch, compress, transform and convert
data before loading to the destination
Kinesis Firehose
• Data stored up to 24 hours
• Batches records to certain size or for certain
period of time
• 1 to 128 MB
• 60 to 900 seconds
• Uses Glue Data Catalog to convert JSON to
• Apache Parquet
• Apache ORC
Kinesis Firehose
Kinesis Streams vs. Firehose
• Fully managed service to
stream data
• Data available up to 7 days
• Scaling using shards
• Custom stream processing with
consumers
• Fully managed service to
load data into a data lake
• Data available for 24 hours
• Scales automatically
• Batching, compressing, converting
data out of the box
+ custom transformations with
Amazon Kinesis
Agent
• Stand-alone Java
application to
stream data from
files
Service
Integrations
• Kinesis Streams
• CloudWatch Logs
• CloudWatch Events
• AWS IoT
Amazon Kinesis API
(AWS SDK)
• Most flexible
• Allows full control
over writing data
Kinesis Firehose, Writing Data
putRecordBatch(params, callback)
• Request failure
• Retries by default up to 3 times
• Uses exponential backoff
• Base delay by default is 100 ms
Kinesis Firehose, AWS SDK
Load the data to the datalake in a
columnar format
Enable content personalization
through near real-time analytics
Agenda
• Fully managed service to run SQL queries
on the streaming data
• Join, filter and aggregate data over a time-based or a row-based window
Kinesis Data Analytics
Ingests data from:
Kinesis Data Stream
Kinesis Firehose
Sends results to:
Kinesis Data Stream
Kinesis Firehose
Lambda function
Gotchas and
Lessons Learned
putRecords(params, callback)
Partial failure:
• Exponential backoff + jitter
Gotcha!
Writing to Kinesis Streams
• Kinesis limits are per second, CloudWatch metrics are per minute
• 1 MB/sec or 1 000 records/sec
• Can 5 000 records/minute exceed the throughput?
• Beware of network latency!
• can be one reason for bursts in Kinesis
• avoid the external network by using a Kinesis VPC endpoint
Gotcha!
Writing to Kinesis Streams
• IncomingRecords = PutRecord + PutRecods
The number of records successfully put to the Kinesis Stream
• WriteProvisionedThroughputExceeded = PutRecord + PutRecords
The number of records rejected due to throttling
• IncomingRecords + WriteProvisionedThroughputExceeded = Total
amount of incoming records
Gotcha!
Writing to Kinesis Streams
• IteratorAge
• latency between when a record is added, and when it is processed
• If it’s increasing, increase the number of shards, or
• increase the parallelization factor (NEWish)
• Two different iterator age metrics
• Kinesis stream iterator age is a combination metric across all consumers
• not too informative
• Lambda’s own iterator age should be used instead!
Gotcha!
Reading from Kinesis Streams
• Beware of timeouts!
• connectTimeout: timeout for establishing a new connection on a
socket
• if not explicitly set, this value will default to the value of timeout
• timeout: read timeout for an existing socket (2 min)
• time between when request ends and the response is
received, including service and network round-trips
Gotcha!
• Firehose scales endlessly!
Or does it?
• “It is a fully managed service that automatically scales to match the
throughput of your data.”
• ”When Direct PUT is configured as the data source, each Kinesis Data
Firehose delivery stream is subject to the following limits: […]
5,000 records/second, 2,000 transactions/second, and 5 MiB/second.
• ThrottledRecords: the number of records that were throttled because data
ingestion exceeded one of the delivery stream limits.
Gotcha!
• Always learn about the service limits
• (there are always limits)
• hard and soft limits
• Keep a close eye on lambda’s
concurrency limits
• Deep dive into the error handling
• Don’t just assume things ..
• If not sure, ask the AWS Support
• Keep a close eye on service updates
• Everything fails all the time, especially at scale, so better be prepared and
fail fast
“Everything fails,
all the time”
Dr. Werner Vogels
(CTO, Amazon)
Lessons Learned
Thank you!
ANAHIT POGOSOVA
@anahit_fi
Shameless Plug
Real World Serverless with Yan Cui, @theburningmonk
episode #14
Mastering AWS Kinesis Data Streams, Part 1 (2)
dev.solita.fi
@anahit_fi
Anahit Pogosova
AWS Community Nordics Virtual Meetup

AWS Community Nordics Virtual Meetup

  • 1.
    Serverless Data Streaming atScale Anahit Pogosova Lead Cloud Software Engineer (Solita / Yle) 20.10.2020
  • 2.
    o Who, What& Why? o Under the Hood o Gotchas and Lessons Learned
  • 4.
    • Lead CloudSoftware Engineer • Part of the Data & AI team at Finland’s national public broadcasting company Me • AWS Community Builder 10+ years ”full stack”, all kinds of stuff
  • 5.
    • Yle Areena,the biggest streaming service in Finland • Areena recommendations • Areena image personalisation • Automatic image extraction • Article recommendations (yle.fi) • Smart notifications (Yle Uutisvahti) • .. and more Yle
  • 6.
    • Data! • userinteraction and content metadata • Collecting the data • Storing the data • Visualizing the data • Utilizing the data (ML & AI) • To understand the customers • To help provide better service for everyone { "adobe":true, "is_heartbeat":true, "collectorreceived":1555267233549, ... "s:asset:name":"Yle TV1", "s:event:type":"start", "s:meta:category":"nettitv", "s:meta:content_type":"livetv", "s:meta:ns_st_st":"yle tv1", ... "s:meta:title":"eduskuntavaalit 2019 - tulosilta", ... "s:meta:yle.vrsContent":"video", "s:meta:yle.vrsDevice":"android", "s:meta:yle.vrsPlatform":"mobile", "s:meta:yle.vrsProduct":"areena", "s:meta:yle_client":"android.areena.481-b4ce224bf", "s:meta:yle_language":"fi", "s:sp:channel":"yleisradio", "s:sp:hb_version":"android-2.2.1.214-d5c678", "s:user:mid":"71057009616815049761612335654599557361" } Yle
  • 7.
    • ~ 500000 000 requests per day • ~ 600 000 rpm during prime time • > 0.5 TB event data per day • Apache Parquet • JSON • Max so far: ~ 2.5 mln rpm • elections + hockey finals Yle
  • 8.
  • 10.
    Agenda Load the datato the datalake in a columnar format Enable content personalization through near real-time analytics
  • 11.
    No server iseasier to manage than “no server”. Dr. Werner Vogels (CTO, Amazon)
  • 13.
    Kinesis Data Streams •Fully managed and massively scalable service to stream data • Data available in milliseconds and stored from 24 hours to up to 7 days • Custom stream processing with consumers • Shard is the unit of parallelism • In: 1 MB/sec or 1 000 records/sec • Out: 2 MB/sec
  • 14.
    Amazon Kinesis Agent • Stand-aloneJava application to stream data from files Service Integrations • CloudWatch Logs • CloudWatch Events • AWS IoT • DB Migration Service • API Gateway Amazon Kinesis Producer Library (KPL) • Provides higher level of abstraction over API calls Amazon Kinesis API (AWS SDK) • Most flexible • Allows full control over writing data Kinesis Data Streams, Writing Data
  • 15.
    • putRecord(params, callback) •putRecords(params, callback) • Up to 500 records • Up to 5 MiB Kinesis Data Streams, AWS SDK
  • 16.
    putRecords(params, callback) • Requestfailure • Retries by default up to 3 times • Uses exponential backoff • Base delay by default is 100 ms Kinesis Data Streams, AWS SDK
  • 17.
    Lambda o One Lambdais invoked per each shard by default • NEW(ish)! Parallelization factor (max 10) • Up to 10 times as many concurrent Lambdas as there are shards!
  • 18.
    o Lambda isinvoked once per second, or: • the number of records reaches the configured batch size (max 10 000 records) • the record batch size reaches synchronous Lambda’s payload limit (6MB) • NEW(ish)! the batch window reaches its maximum value (max 5 min) Lambda
  • 19.
    Before.. • Lambda retriesthe batch until success or data expiration • No other batches are processed from the shard (aka poison pill)! Lambda, Error Handling After! • Maximum retry attempts (max 10 000) • Maximum record age (1 min – 7 days) • Bisect batch on function failure • On-failure destination (SQS or SNS)
  • 20.
    Agenda Load the datato the datalake in a columnar format Enable content personalization through near real-time analytics
  • 22.
    • Fully managedservice to load streaming data into a data lake • S3, Redshift, AWS Elasticsearch • HTTP endpoints (New!) • Datadog, New Relic, MongoDB, and Splunk (Newish!) • Allows to load streaming data with 0 lines of code • Scales automatically (no shards to manage) • Can batch, compress, transform and convert data before loading to the destination Kinesis Firehose
  • 23.
    • Data storedup to 24 hours • Batches records to certain size or for certain period of time • 1 to 128 MB • 60 to 900 seconds • Uses Glue Data Catalog to convert JSON to • Apache Parquet • Apache ORC Kinesis Firehose
  • 24.
    Kinesis Streams vs.Firehose • Fully managed service to stream data • Data available up to 7 days • Scaling using shards • Custom stream processing with consumers • Fully managed service to load data into a data lake • Data available for 24 hours • Scales automatically • Batching, compressing, converting data out of the box + custom transformations with
  • 25.
    Amazon Kinesis Agent • Stand-aloneJava application to stream data from files Service Integrations • Kinesis Streams • CloudWatch Logs • CloudWatch Events • AWS IoT Amazon Kinesis API (AWS SDK) • Most flexible • Allows full control over writing data Kinesis Firehose, Writing Data
  • 26.
    putRecordBatch(params, callback) • Requestfailure • Retries by default up to 3 times • Uses exponential backoff • Base delay by default is 100 ms Kinesis Firehose, AWS SDK
  • 27.
    Load the datato the datalake in a columnar format Enable content personalization through near real-time analytics Agenda
  • 29.
    • Fully managedservice to run SQL queries on the streaming data • Join, filter and aggregate data over a time-based or a row-based window Kinesis Data Analytics Ingests data from: Kinesis Data Stream Kinesis Firehose Sends results to: Kinesis Data Stream Kinesis Firehose Lambda function
  • 31.
  • 32.
    putRecords(params, callback) Partial failure: •Exponential backoff + jitter Gotcha! Writing to Kinesis Streams
  • 33.
    • Kinesis limitsare per second, CloudWatch metrics are per minute • 1 MB/sec or 1 000 records/sec • Can 5 000 records/minute exceed the throughput? • Beware of network latency! • can be one reason for bursts in Kinesis • avoid the external network by using a Kinesis VPC endpoint Gotcha! Writing to Kinesis Streams
  • 34.
    • IncomingRecords =PutRecord + PutRecods The number of records successfully put to the Kinesis Stream • WriteProvisionedThroughputExceeded = PutRecord + PutRecords The number of records rejected due to throttling • IncomingRecords + WriteProvisionedThroughputExceeded = Total amount of incoming records Gotcha! Writing to Kinesis Streams
  • 35.
    • IteratorAge • latencybetween when a record is added, and when it is processed • If it’s increasing, increase the number of shards, or • increase the parallelization factor (NEWish) • Two different iterator age metrics • Kinesis stream iterator age is a combination metric across all consumers • not too informative • Lambda’s own iterator age should be used instead! Gotcha! Reading from Kinesis Streams
  • 36.
    • Beware oftimeouts! • connectTimeout: timeout for establishing a new connection on a socket • if not explicitly set, this value will default to the value of timeout • timeout: read timeout for an existing socket (2 min) • time between when request ends and the response is received, including service and network round-trips Gotcha!
  • 37.
    • Firehose scalesendlessly! Or does it? • “It is a fully managed service that automatically scales to match the throughput of your data.” • ”When Direct PUT is configured as the data source, each Kinesis Data Firehose delivery stream is subject to the following limits: […] 5,000 records/second, 2,000 transactions/second, and 5 MiB/second. • ThrottledRecords: the number of records that were throttled because data ingestion exceeded one of the delivery stream limits. Gotcha!
  • 38.
    • Always learnabout the service limits • (there are always limits) • hard and soft limits • Keep a close eye on lambda’s concurrency limits • Deep dive into the error handling • Don’t just assume things .. • If not sure, ask the AWS Support • Keep a close eye on service updates • Everything fails all the time, especially at scale, so better be prepared and fail fast “Everything fails, all the time” Dr. Werner Vogels (CTO, Amazon) Lessons Learned
  • 39.
  • 40.
    Shameless Plug Real WorldServerless with Yan Cui, @theburningmonk episode #14 Mastering AWS Kinesis Data Streams, Part 1 (2) dev.solita.fi @anahit_fi Anahit Pogosova