Fernando Rodriguez Olivera
@frodriguez
Buenos Aires, Argentina, Dec 2015
Amazon Kinesis
AWS User Group Argentina
Twitter: @frodriguez
Professor at Universidad Austral (Distributed Systems, Compiler
Design, Operating Systems, …)
Creator of mvnrepository.com
Organizer at Buenos Aires High Scalability Group
Fernando Rodriguez Olivera
Amazon Kinesis Streams
High-throughput, low-latency service
for real-time data processing over large,
distributed data streams
Kinesis Streams
...
...
Producers
Kinesis
Stream
data retention
between 24 to 168 hrs
App #1
App #2
designed for < 1 sec
latency
Shards
...
...
Producers
Kinesis Stream
App #1
App #2
Shard 1
Shard 2
PK9PK9
PK7PK1 PK1
KinesisEndpoints
Shard 3
PK3PK6
Records annotated with same Partition Key (PK) are stored in the same shard
Shard Capacity
New
Records
Get
Records
24h Retention
Max 86.4GB
168h Retention
Max 604.8GB
1 MB/s
1K put/s
2 MB/s
5 tx/s
3.6 GB/h
3.6 M put/h
86.4 GB/d
86.4 M put/d
7.2 GB/h
18k tx/h
172.8 GB/d
432k tx/d
Shard Pricing
24h Retention
$0.015/hr
$11/month
Up to 168h Retention
$0.035/hr
$25.6/month
Extended Retention
$0.020/hr
$14.6/month
* Prices for us-east
+ $0.014 per 1,000,000 PUT Payload Units (1 unit = 25KB)
Max Record Size = 1MB
Kinesis from AWS CLI
aws kinesis create-stream --stream-name myStream
--shard-count 1
aws kinesis list-streams
{
"StreamNames": [
"myStream"
]
}
aws kinesis put-record --stream-name myStream
--partition-key 123
--data “my data”
Collecting Records from SDK
kinesis = new AmazonKinesisClient(…)



result = kinesis.putRecord(new PutRecordRequest()

.withStreamName("myStream")

.withPartitionKey("partitionKey")

.withData(bytes))
kinesis = new AmazonKinesisAsyncClient(…)



future = kinesis.putRecordAsync(new PutRecordRequest()

.withStreamName("myStream")

.withPartitionKey("partitionKey")

.withData(bytes))
or
Collecting Records (Batch)
kinesis = new AmazonKinesisClient(…)

...
records.add(new PutRecordsRequestEntry()

.withPartitionKey("partitionKey")

.withData(bytes))
records.add(…)


results = kinesis.putRecords(new PutRecordsRequest()

.withStreamName("myStream")

.withRecords(records))

KPL (Kinesis Producer Library)
aggregationbuffering collection
w/PutRequests
records
Collecting with KPL
config = new KinesisProducerConfiguration()
.setRecordMaxBufferedTime(200) // millis
.setMaxConnections(4)
.setRequestTimeout(60000)
.setRegion(“us-east-1”)
producer = new KinesisProducer(config);
producer.addUserRecord(“myStream”, “partitionKey1”, bytes1);
producer.addUserRecord(“myStream”, “partitionKey2”, bytes2);
Consumer APIs
High-level API (KCL = Kinesis Client Library)
Low-level API (with shard iterators)
Low-Level API with Shard Iterators
AT_SEQUENCE_NUMBER
LATEST
TRIM_HORIZON
AFTER_SEQUENCE_NUMBER
New
Records
All Records
in Last 24hs
New Records
Get
Records
Max 5 read transactions per second per shard
Shard
Kinesis from AWS CLI
aws kinesis describe-stream --stream-name myStream
{
"StreamDescription": {
"StreamStatus": "ACTIVE",
"StreamName": "myStream",
"StreamARN": "arn:aws:kinesis:…:stream/myStream",
"Shards": [
{
"ShardId": "shardId-000000000000",
"HashKeyRange": {
"EndingHashKey": "…",
"StartingHashKey": "…"
},
"SequenceNumberRange": {
"StartingSequenceNumber": "…"
}
}
]
}
}
Kinesis from AWS CLI
aws kinesis get-shard-iterator --stream-name myStream
--shard-id shardId-000000000000
--shard-iterator-type TRIM_HORIZON
{
"ShardIterator": "… iterator id …"
}
aws kinesis get-records --shard-iterator "… iterator id .."
{
"Records":[ {
"Data": "...",
"PartitionKey": "...",
"SequenceNumber": "..."
} ],
"MillisBehindLatest": 1000,
"NextShardIterator": "… new iterator id …"
}
Splitting/Merging Shards
Shard (CLOSED)
Shard (OPEN)
old records remains
at parent
children
Shard (OPEN)
after 24hs states
changes from CLOSED
to EXPIRED
new
events
added to
children
GetRecords consumes from parent by using
1 shard iterator until split is detected.
Then 2 iterators are required to consume from children
Consuming Records with KCL
App w/2 consumersStream with 3 shards
Record
Processor
KCLKCL
Record
Processor
Record
Processor
KCL (Kinesis Client Library)
Shard processing balanced across nodes
If node fails, shards are re-assigned to remaining nodes
machine01machine02
KCL Coordination w/DynamoDB
App w/2 consumer nodes
Record
Processor
KCL
KCL
Record
Processor
Record
Processor
lease key checkpoint lease counter lease owner
shard01 … 123 machine01
shard02 … 234 machine01
shard03 … 345 machine02
machine01 machine02
lease counter continuously incremented (as a heart-beat)
App Id used a table name. DynamoDB with conditional updates
DynamoDB
TableName=AppID
Consuming Records (KCL)
class MyProcessor implements IRecordProcessor {
void processRecords(
List<Record> records,
IRecordProcessorCheckpointer checkpointer)
{
for (Record record: records) {
// Process record …
}


checkpointer.checkpoint()

}
}
* KCL available for: Java, Node.js, .NET, Python, Ruby
Thanks,
Fernando Rodriguez Olivera
@frodriguez
frodriguez <at> gmail.com

AWS Kinesis Streams

  • 1.
    Fernando Rodriguez Olivera @frodriguez BuenosAires, Argentina, Dec 2015 Amazon Kinesis AWS User Group Argentina
  • 2.
    Twitter: @frodriguez Professor atUniversidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group Fernando Rodriguez Olivera
  • 3.
    Amazon Kinesis Streams High-throughput,low-latency service for real-time data processing over large, distributed data streams
  • 4.
    Kinesis Streams ... ... Producers Kinesis Stream data retention between24 to 168 hrs App #1 App #2 designed for < 1 sec latency
  • 5.
    Shards ... ... Producers Kinesis Stream App #1 App#2 Shard 1 Shard 2 PK9PK9 PK7PK1 PK1 KinesisEndpoints Shard 3 PK3PK6 Records annotated with same Partition Key (PK) are stored in the same shard
  • 6.
    Shard Capacity New Records Get Records 24h Retention Max86.4GB 168h Retention Max 604.8GB 1 MB/s 1K put/s 2 MB/s 5 tx/s 3.6 GB/h 3.6 M put/h 86.4 GB/d 86.4 M put/d 7.2 GB/h 18k tx/h 172.8 GB/d 432k tx/d
  • 7.
    Shard Pricing 24h Retention $0.015/hr $11/month Upto 168h Retention $0.035/hr $25.6/month Extended Retention $0.020/hr $14.6/month * Prices for us-east + $0.014 per 1,000,000 PUT Payload Units (1 unit = 25KB) Max Record Size = 1MB
  • 8.
    Kinesis from AWSCLI aws kinesis create-stream --stream-name myStream --shard-count 1 aws kinesis list-streams { "StreamNames": [ "myStream" ] } aws kinesis put-record --stream-name myStream --partition-key 123 --data “my data”
  • 9.
    Collecting Records fromSDK kinesis = new AmazonKinesisClient(…)
 
 result = kinesis.putRecord(new PutRecordRequest()
 .withStreamName("myStream")
 .withPartitionKey("partitionKey")
 .withData(bytes)) kinesis = new AmazonKinesisAsyncClient(…)
 
 future = kinesis.putRecordAsync(new PutRecordRequest()
 .withStreamName("myStream")
 .withPartitionKey("partitionKey")
 .withData(bytes)) or
  • 10.
    Collecting Records (Batch) kinesis= new AmazonKinesisClient(…)
 ... records.add(new PutRecordsRequestEntry()
 .withPartitionKey("partitionKey")
 .withData(bytes)) records.add(…) 
 results = kinesis.putRecords(new PutRecordsRequest()
 .withStreamName("myStream")
 .withRecords(records))

  • 11.
    KPL (Kinesis ProducerLibrary) aggregationbuffering collection w/PutRequests records
  • 12.
    Collecting with KPL config= new KinesisProducerConfiguration() .setRecordMaxBufferedTime(200) // millis .setMaxConnections(4) .setRequestTimeout(60000) .setRegion(“us-east-1”) producer = new KinesisProducer(config); producer.addUserRecord(“myStream”, “partitionKey1”, bytes1); producer.addUserRecord(“myStream”, “partitionKey2”, bytes2);
  • 13.
    Consumer APIs High-level API(KCL = Kinesis Client Library) Low-level API (with shard iterators)
  • 14.
    Low-Level API withShard Iterators AT_SEQUENCE_NUMBER LATEST TRIM_HORIZON AFTER_SEQUENCE_NUMBER New Records All Records in Last 24hs New Records Get Records Max 5 read transactions per second per shard Shard
  • 15.
    Kinesis from AWSCLI aws kinesis describe-stream --stream-name myStream { "StreamDescription": { "StreamStatus": "ACTIVE", "StreamName": "myStream", "StreamARN": "arn:aws:kinesis:…:stream/myStream", "Shards": [ { "ShardId": "shardId-000000000000", "HashKeyRange": { "EndingHashKey": "…", "StartingHashKey": "…" }, "SequenceNumberRange": { "StartingSequenceNumber": "…" } } ] } }
  • 16.
    Kinesis from AWSCLI aws kinesis get-shard-iterator --stream-name myStream --shard-id shardId-000000000000 --shard-iterator-type TRIM_HORIZON { "ShardIterator": "… iterator id …" } aws kinesis get-records --shard-iterator "… iterator id .." { "Records":[ { "Data": "...", "PartitionKey": "...", "SequenceNumber": "..." } ], "MillisBehindLatest": 1000, "NextShardIterator": "… new iterator id …" }
  • 17.
    Splitting/Merging Shards Shard (CLOSED) Shard(OPEN) old records remains at parent children Shard (OPEN) after 24hs states changes from CLOSED to EXPIRED new events added to children GetRecords consumes from parent by using 1 shard iterator until split is detected. Then 2 iterators are required to consume from children
  • 18.
    Consuming Records withKCL App w/2 consumersStream with 3 shards Record Processor KCLKCL Record Processor Record Processor KCL (Kinesis Client Library) Shard processing balanced across nodes If node fails, shards are re-assigned to remaining nodes machine01machine02
  • 19.
    KCL Coordination w/DynamoDB Appw/2 consumer nodes Record Processor KCL KCL Record Processor Record Processor lease key checkpoint lease counter lease owner shard01 … 123 machine01 shard02 … 234 machine01 shard03 … 345 machine02 machine01 machine02 lease counter continuously incremented (as a heart-beat) App Id used a table name. DynamoDB with conditional updates DynamoDB TableName=AppID
  • 20.
    Consuming Records (KCL) classMyProcessor implements IRecordProcessor { void processRecords( List<Record> records, IRecordProcessorCheckpointer checkpointer) { for (Record record: records) { // Process record … } 
 checkpointer.checkpoint()
 } } * KCL available for: Java, Node.js, .NET, Python, Ruby
  • 21.