Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amazon Kinesis

4,472 views

Published on

AWS Technical Day - Amazon Kinesis

Published in: Technology
  • You might get some help from ⇒ HelpWriting.net ⇐ Success and best regards!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you have any problems with writing, feel free to ask our writers for help! The team of Paper Help ⇒ HelpWriting.net ⇐ is ready to help with any kind of academic writing!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ♥♥♥ http://bit.ly/39sFWPG ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❤❤❤ http://bit.ly/39sFWPG ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Amazon Kinesis

  1. 1. Deep Dive – Amazon Kinesis Guy Ernest, Solution Architect - Amazon Web Services @guyernest
  2. 2. Motivation for Real Time Analytics E* BI RT ML
  3. 3. Analytics Amazon Kinesis Managed Service for Real Time Big Data Processing Create Streams to Produce & Consume Data Elastically Add and Remove Shards for Performance Use Kinesis Worker Library, AWS Lambda, Apache Spark and Apache Storm to Process Data Integration with S3, Redshift and Dynamo DB Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics
  4. 4. Data Sources App.4 [Machine Learning] AWSEndpoint App.1 [Aggregate & De-Duplicate] Data Sources Data Sources Data Sources App.2 [Metric Extraction] S3 DynamoDB Redshift App.3 [Sliding Window Analysis] Data Sources Shard 1 Shard 2 Shard N Availability Zone Availability Zone Amazon Kinesis Dataflow Availability Zone
  5. 5. Billing Auditors Incremental Bill Computation Metering Archive Billing Management Service Example Architecture - Metering
  6. 6. Streams Named Event Streams of Data All data is stored for 24 hours Shards You scale Kinesis streams by adding or removing Shards Each Shard ingests up to 1MB/sec of data and up to 1000 TPS Partition Key Identifier used for Ordered Delivery & Partitioning of Data across Shards Sequence Number of an event as assigned by Kinesis Amazon Kinesis Components
  7. 7. Getting Data In
  8. 8. Producers use a PUT call to store data in a Stream A Partition Key is used to distribute the PUTs across Shards A unique Sequence # is returned to the Producer for each Event Data can be ingested at 1MB/second or 1000 Transactions/second per Shard 1MB / Event Kinesis - Ingesting Fast Moving Data
  9. 9. Native Code Module to perform efficient writes to Multiple Kinesis Streams C++/Boost Asynchronous Execution Configurable Aggregation of Events Introducing the Kinesis Producer Library My Application KPL Daemon PutRecord(s) Kinesis Stream Kinesis Stream Kinesis Stream Kinesis Stream Async
  10. 10. KPL Aggregation My Application KPL Daemon PutRecord(s) Kinesis Stream Kinesis Stream Kinesis Stream Kinesis Stream Async 1MB Max Event Size Aggregate 100k 20k500k200k 40k 20k40k 500k100k 200k 20k 40k 40k 20k Protobuf Header Protobuf Footer
  11. 11. Apache Flume Source & Sink https://github.com/pdeyhim/flume- kinesis FluentD Dynamic Partitioning Support https://github.com/awslabs/aws- fluent-plugin-kinesis Log4J & Log4Net Included in Kinesis Samples Kinesis Ecosystem - Ingest Kinesis
  12. 12. Best Practices for Partition Key • Random will give even distribution • If events should be processed together, choose a relevant high cardinality partition key and monitor shard distribution • If partial order is important use sequence number
  13. 13. Getting Data Out
  14. 14. KCL Libraries available for Java, Ruby, Node, Go, and a Multi-Lang Implementation with Native Python support All State Management in Dynamo DB Kinesis Client Library DynamoDB
  15. 15. Client library for fault-tolerant, at least-once, real-time processing Kinesis Client Library (KCL) simplifies reading from the stream by abstracting your code from individual shards Automatically starts a Worker Thread for each Shard Increases and decreases Thread count as number of Shards changes Uses checkpoints to keep track of a Thread’s location in the stream Restarts Threads & Workers if they fail Consuming Data - Kinesis Enabled Applications
  16. 16. Analytics Tooling Integration (github.com/awslabs/amazon-kinesis-connectors) S3 Batch Write Files for Archive into S3 Sequence Based File Naming Redshift Once Written to S3, Load to Redshift Manifest Support User Defined Transformers DynamoDB BatchPut Append to Table User Defined Transformers ElasticSearch Automatically index Stream Kinesis Connectors S3 Dynamo DB Redshift Kinesis ElasticSearch
  17. 17. Connectors Architecture
  18. 18. Apache Storm Kinesis Spout Automatic Checkpointing with Zookeeper https://github.com/awslabs/kinesis- storm-spout Kinesis Ecosystem - Storm Storm Kinesis
  19. 19. Apache Spark DStream Receiver runs KCL One DStream per Shard Checkpointed via KCL Spark Natively Available on EMR EMRFS overlay on HDFS AMI 3.8.0 https://aws.amazon.com/elastic mapreduce/details/spark Kinesis Ecosystem - Spark
  20. 20. Distributed Event Processing Platform Stateless JavaScript & Java functions run against an Event Stream AWS SDK Built In Configure RAM and Execution Timeout Functions automatically invoked against a Shard Community libraries for Python & Go Access to underlying filesystem for read/write Call other Lambda Functions Consuming Data - AWS Lambda Kinesis Shard 1 Shard 2 Shard 3 Shard 4 Shard n
  21. 21. Why Kinesis? Durability Regional Service Synchronous Writes to Multiple AZ’s Extremely High Durability ? May be in-memory for Performance Requirement to understand Disk Sync Semantics User Managed Replication Replication Lag -> RPO
  22. 22. Why Kinesis? Performance Perform continual processing on streaming big data. Processing latencies fall to a <1 second, compared with the minutes or hours associated with batch processing ? Processing latencies < 1 second Based on CPU & Disk Performance Cluster Interruption -> Processing Outage
  23. 23. Why Kinesis? Availability Regional Service Synchronous Writes to Multiple AZ’s Extremely High Durability AZ, Networking, & Chain Server Issues Transparent to Producers & Consumers ? Many Depend on a CP Database Lost Quorum can result in failure/inconsistency of the cluster Highest Availability is determined by Availability of Cross-AZ Links or Availability of an AZ
  24. 24. Why Kinesis? Operations Managed service for real-time streaming data collection, processing and analysis. Simply create a new stream, set the desired level of capacity, and let the service handle the rest ? Build Instances Install Software Operate Cluster Manage Disk Space Manage Replication Migrate to new Stream on Scale Up
  25. 25. Why Kinesis? Elasticity Seamlessly scale to match your data throughput rate and volume. You can easily scale up to gigabytes per second. The service will scale up or down based on your operational or business needs ? Fixed Partition Count up Front Maximum Performance ~ 1 Partition/Core | Machine Convert from 1 Stream to Another to Scale Application Reconfiguration
  26. 26. Scaling Streams https://github.com/awslabs/amazon-kinesis-scaling-utils
  27. 27. Why Kinesis? Cost Cost-efficient for workloads of any scale. You can get started by provisioning a small stream, and pay low hourly rates only for what you use. Scale Up/Down Dynamically $.015/Hour/1MB ? Run your Own EC2 Instances Multi-AZ Configuration for increased Durability Utilise Instance AutoScaling on Worker Lag from HEAD with Custom Metrics
  28. 28. Why Kinesis? Cost Price Dropped on 2nd June 2015, Restructured to support KPL Old Pricing: $.028 / 1M Records PUT New Pricing: $.014/1M 25KB “Payload Units” Units Cost Units Cost Shards 50 $558 25 $279 PutRecords 4,320M Records $120.96 2,648M Payload Units $37.50 $678.96 $316.50 Scenario: 50,000 Events / Second, 512B / Event = 24.4 MB/Second Old Pricing New Pricing + KPL
  29. 29. Kinesis – Consumer Application Best Practices Tolerate Failure of: Threads – Consider Data Serialisation issues and Lease Stealing; Hardware – AutoScaling may add nodes as needed Scale Consumers up and down as the number of Shards increase or decrease Don’t store data in memory in the workers. Use an elastic data store such as Dynamo DB for State Elastic Beanstalk provides all Best Practices in a simple to deploy, multi-version Application Container KCL will automatically redistribute Workers to use new Instances Logic implemented in Lambda doesn’t require any Servers at all!
  30. 30. Managing Application State
  31. 31. Consumer Local State Anti-Pattern Consumer binds to a configured number of Partitions Consumer stores the ‘state’ of a data structure, as defined by the event stream, on local storage Read API can access that local storage as a ‘shard’ of the overall database ? Consumer Consumer Partition 1 Partition … Partition P/2 Partition P/2+1 Partition … Partition P Local Disk Local Disk Local Storage Partitions P1..P/2 Local Storage Partitions P/2+1..P Read API
  32. 32. Consumer Local State Anti-Pattern But what happens when an instance fails? ? Consumer Consumer Partition 1 Partition … Partition P/2 Partition P/2+1 Partition … Partition P Local Disk Local Disk Local Storage Partitions P1..P/2 Local Storage Partitions P/2+1..P Read API
  33. 33. Consumer Local State Anti-Pattern A new consumer process starts up for the required Partitions Consumer must read from the beginning of the Stream to rebuild local storage Complex, error prone, user constructed software Long Startup Time ? Consumer Consumer Partition 1 Partition … Partition P/2 Partition P/2+1 Partition … Partition P Local Disk Local Disk Local Storage Partitions P1..P/2 Local Storage Partitions P/2+1..P Read API T0 THead
  34. 34. External Highly Available State – Best Practice Consumer Consumer Shard 1 Shard … Shard S/2 Shard S/2+1 Shard… Shard S Read API Consumer binds to a even number of Shards based on number of Consumers Consumer stores the ‘state’ in Dynamo DB Dynamo DB is Highly Available, Elastic & Durable Read API can access Dynamo DB DynamoDB
  35. 35. External Highly Available State – Best Practice Consumer binds to a even number of Shards based on number of Consumers Consumer stores the ‘state’ in Dynamo DB Dynamo DB is Highly Available, Elastic & Durable Read API can access Dynamo DB Read API DynamoDB Shard 1 Shard … Shard S/2 Shard S/2+1 Shard… Shard S Consumer Consumer
  36. 36. External Highly Available State – Best Practice Read API DynamoDB AWS Lambda Shard 1 Shard … Shard S/2 Shard S/2+1 Shard… Shard S Consumer binds to a even number of Shards based on number of Consumers Consumer stores the ‘state’ in Dynamo DB Dynamo DB is Highly Available, Elastic & Durable Read API can access Dynamo DB
  37. 37. Idempotency
  38. 38. Property of a system whereby the repeated application of a function on a single input results in the same end state of the system … Exactly Once Processing
  39. 39. Idempotency – Writing Data The Kinesis SDK & KPL may retry PUT in certain circumstances Kinesis Record acknowledged with a Sequence Number is durable to Multiple Availability Zones… But there could be a duplicate entry My Application PutRecord(s) 403 (Endpoint Redirect) Storage
  40. 40. Idempotency – Writing Data The Kinesis SDK & KPL may retry PUT in certain circumstances Kinesis Record acknowledged with a Sequence Number is durable to Multiple Availability Zones… But there could be a duplicate entry My Application PutRecord(s) PutRecord(s) 200 (OK) 403 (Endpoint Redirect) StorageStorage Write 123 Write 123
  41. 41. Coming Soon…
  42. 42. Idempotency – Rolling Idempotency Check Kinesis will manage a rolling time window of Record ID’s in Dynamo DB Record ID’s are User Based Duplicates in storage tier will be acknowledged as Successful My Application PutRecord(s) Storage DynamoDB X Hour Rolling Window
  43. 43. Idempotency – Rolling Idempotency Check Kinesis will manage a rolling time window of Record ID’s in Dynamo DB Record ID’s are User Based Duplicates in storage tier will be acknowledged as Successful My Application PutRecord(s) Storage Write Record ID OK DynamoDB X Hour Rolling Window
  44. 44. Idempotency – Rolling Idempotency Check Kinesis will manage a rolling time window of Record ID’s in Dynamo DB Record ID’s are User Based Duplicates in storage tier will be acknowledged as Successful My Application PutRecord(s) 403 (Endpoint Redirect) Storage Write Record ID OK DynamoDB X Hour Rolling Window
  45. 45. Idempotency – Rolling Idempotency Check Kinesis will manage a rolling time window of Record ID’s in Dynamo DB Record ID’s are User Based Duplicates in storage tier will be acknowledged as Successful My Application PutRecord(s) PutRecord(s) 403 (Endpoint Redirect) StorageStorage Write Record ID OK Write Record ID DynamoDB X Hour Rolling Window ConditionCheckFailedException
  46. 46. Idempotency – Rolling Idempotency Check Kinesis will manage a rolling time window of Record ID’s in Dynamo DB Record ID’s are User Based Duplicates in storage tier will be acknowledged as Successful My Application PutRecord(s) PutRecord(s) 200 (OK) 403 (Endpoint Redirect) StorageStorage Write Record ID OK Write Record ID DynamoDB X Hour Rolling Window ConditionCheckFailedException
  47. 47. Easy Administration Real-time Performance. High Durability High Throughput. Elastic S3, Redshift, & DynamoDB Integration Large Ecosystem Low Cost In Short…

×