Amazon Kinesis is a platform for streaming data on AWS, offering powerful services to make it easy to load and analyze streaming data. In this session, you'll learn about how AWS customers are transitioning from batch to real-time processing using Amazon Kinesis, and how to get started. We will provide an overview of streaming data applications and introduce the Amazon Kinesis platform and its services. We will walk through a production use case to demonstrate how to ingest streaming data, prepare it, and analyze it to gain actionable insights in real time using Amazon Kinesis. We will also provide pointers to tutorials and other resources so you can quickly get started with your streaming data application.
2. What to expect from this session
• Real-Time Streaming Analytics – Why Now?
• Amazon Kinesis Overview
• New Features
• Design Patterns
• Nutrino Case Study
3. Most data is produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
5. The diminishing value of data
Recent data is highly valuable
• If you act on it in time
• Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
• If you have the means to combine them
6. Processing real-time, streaming data
• Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Ingest Transform Analyze React Persist
7. Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Machine Learning and
Actionable Insights
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Operation
monitoring
Security
DevOps tools,
Ingesting
VPCFlowLogs
Subscribe to
CloudwatchLogs and
analyze logs in Real-
Time
Anomaly Detection
10. Amazon Kinesis
Kinesis FirehoseKinesis AnalyticsKinesis Streams
DeliverProcess
For Technical Developers
Build your own custom
applications that process
or analyze streaming data
11. Amazon Kinesis
Kinesis FirehoseKinesis AnalyticsKinesis Streams
Deliver
For Technical Developers
Build your own custom
applications that process
or analyze streaming data
For All Developers, Data
Scientists, Business Analysts
Easily analyze data streams
using standard SQL queries
12. Amazon Kinesis
Kinesis FirehoseKinesis AnalyticsKinesis Streams
For Technical Developers
Build your own custom
applications that process
or analyze streaming data
For All Developers, Data
Scientists, Business Analysts
Easily analyze data streams
using standard SQL queries
For All Developers
Easily load massive volumes of
streaming data into S3,
Redshift and Elasticsearch,
clean and format using Lambda
13. Amazon Kinesis Customer Base Diversity
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad
impressions/day, 30 ms
response time | Ad Tech
100 GB/day click streams
from 250+ sites |
Enterprise
50 billion ad
impressions/day sub-50
ms responses | Ad Tech
10 million events/day
| Retail
Ingesting 2M+ network
events every second
Funnel all
production events
through Amazon
Kinesis
14. Why are these customers choosing Amazon Kinesis?
Lower costs
Performant without
heavy lifting
Scales elastically
Increased agility
Secure and visible
Plug and play
15. Amazon Kinesis Streams
Build your own data streaming applications
Easy administration: Simply create a new stream, and set the desired level of capacity with
shards. Scale to match your data throughput rate and volume.
Build real-time applications: Perform continual processing on streaming big data using
Kinesis Client Library (KCL), Kinesis Analytics, Apache Spark/Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.
16. • Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours – 7 days
• Scale Kinesis streams by splitting or
merging Shards
• Replay data inside of 24Hr -7days
Window
Amazon Kinesis Stream
Managed Ability to capture and store Data
Time-based
seek
17. Amazon Kinesis Streams pricing
Simple, pay-as-you-go, and no up-front costs
Dimension Value
Shard Hour $0.015
PUT Payload (per 1,0000,000 PUTs) $0.014
Extended Data Retention (Up to 7 days),
per Shard Hour
$0.020
18. Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Machine Learning and
Actionable Insights
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Operation
Security
DevOps tools, Ingesting
VPCFlowLogs
Subscribe to
CloudwatchLogs and
analyze logs in Real-Time
Anomaly Detection
19. Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift
and Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other
destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data
destinations in as little as 60 secs using simple configurations.
Elastic: Scales to match data throughput w/o intervention
Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming
source data.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Redshift
and Elasticsearch
20. Amazon Kinesis Firehose pricing
Simple, pay-as-you-go, and no up-front costs
Dimension Value
Per 1 GB of data ingested $0.029 - *$0.020
*Tier pricing
22. Kinesis Firehose vs. Kinesis Streams
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a
data latency of 60 seconds or higher.
23. Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Machine Learning and
Actionable Insights
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Operation
Security
DevOps tools, Ingesting
VPCFlowLogs
Subscribe to
CloudwatchLogs and
analyze logs in Real-Time
Anomaly Detection
24. Amazon Kinesis Analytics
Apply SQL on streams: Easily connect to a Kinesis Stream or Firehose Delivery
Stream and apply SQL skills.
Build real-time applications: Perform continual processing on streaming big data
with sub-second processing latencies.
Easy Scalability : Elastically scales to match data throughput.
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Kinesis Analytics can send processed data
to analytics tools so you can create alerts
and respond in real-time
25. Use SQL to build real-time applications
Easily write SQL code to process
streaming data
Connect to streaming source
Continuously deliver SQL results
31. Capture all of the value from your data with Amazon Kinesis
Amazon
S3
Amazon
Kinesis
Analytics
Amazon Kinesis–
enabled app
Amazon Kinesis
Streams
Ingest Process React Persist
AWS
Lambda
0ms 200ms 1-2 s
Amazon
QuickSight
Amazon Kinesis
Firehose
Amazon
Redshift
33. Amazon Kinesis Data Generator on GitHub
https://awslabs.github.io/amazon-kinesis-data-generator/
34. Putting Data in Amazon Kinesis
AWS SDK
• PutRecord()
• PutRecordBatch()
Kinesis Agent
• Monitors files and sends new data records to your delivery stream
• Handles file rotation, checkpointing, and retry upon failures
• Pre-processing capabilities such as format conversion and log parsing
• Emits AWS CloudWatch metrics for monitoring and troubleshooting
39. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.coNutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Daniel Modlinger
VP R&D
AWS Summit Tel Aviv
40. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Nutrition advice - conflicted
and not personal
$435B
41. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Food Analysis System
Scientific Literature
Food
(10,000+ papers)
(1,000,000+ foods)
(300,000+ restaurant locations)
Menus
FoodPrint™
Correlations
Insights
Predictions
Personal & continually
improving nutrition recommendations
Activity
Sleep
Continuous Glucose
Blood Tests
DNA - Genome
Taste, allergies etc.
Mood
Blood pressure
Temperature
Biome
Social media
Individual Inputs
42. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Examples - Partners
Pregnancy
Diabetes
ConsumerEmployer/payer
Retail
confidential
Hypertension
confidential
43. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.coNutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Data Streaming
44. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
FoodPrint™
Activity
Sleep
Continuous Glucose
Blood Tests
DNA - Genome
Taste, allergies etc.
Mood
Blood pressure
Temperature
Biome
Social media
Individual Inputs
45. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Needs
Volume - Millions of daily data points
Scale - Non linear growth
Performance - Near real time, as low as 3rd party producers of data
allow
Extract as much data as possible and store unknown data types and
formats
Process and store a wide variety of known data types
Goal: Process and store nutrition-related sensor data
46. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
First attempts
2015 2016 2017
47. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Updated Needs (AKA lessons learned)
Cut the middle man
Detailed monitoring is a necessity
Be event driven
Fast developer response to changes in APIs
Reliability is key, otherwise data gets lost
Goal: Process and store nutrition related sensors data
48. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Current Solution - High Level Architecture
49. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Lambda allows us to write simple and clean code that will
deal with each entity without the need to be aware of flow
or context
Why Kinesis & Lambda (1)
Process and store sensor data and all other nutrition-related data
Scale - Non linear growth
Performance - Near real time
Store unknown data types and formats
Wide variety of known data types
All unknow data is saved to S3 using Kinesis Firehose,
which is later analyzed by our Spark cluster
With Lambda triggers on streams we can act
quickly enough
Scaling can be as easy as adding a shard, Lambda
functions are executed as much as we need
50. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Volume - Millions of daily data points
S3 is limitless, Kinesis streams are practically limitless
Why Kinesis & Lambda (2)
Process and store sensor data and all other nutrition-related data
Reliability is key, otherwise data gets lost
Allow fast developer response to changes
in APIs
Be event driven wherever possible
because polling causes problems
Kinesis streams are persistent (from 24h up to one
week), Lambda functions automatically retry,
cloudwatch has detailed monitoring of everything
Using API Gateway and Lambda integration allows 3rd
party producers to activate the processing
Changing one Lambda function does not affect the flow,
the deploy resolution can be as low as a single function
51. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Why Kinesis & Lambda (3)
Detailed monitoring is a necessity and
not a luxury
Cut the middle man - avoid writing
infrastructure to allow clients (iOS)
write sensor data
Iterator Age, put count, Lambda monitoring are a good
baseline to start from
We are using AWS iOS SDK to write the sensor data
Process and store sensor data and all other nutrition-related data
52. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
• Optimal shards number is a derivative of desired iterator age (aka latency)
• Optimal batch size - If your Lambda scales in a linear fashion then it’s doesn’t
really matters..
• A high number will cause less calls and can create an internal queue for the
consumer Lambda function (which will increase the latency to the user)
• A low number can be expensive in terms of Lambda overhead and price.
Scale & Performance (1)
Best Practice
53. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Scale & Performance (2)
Best Practice
• Lambda functions run sequentially on the shard’s data, but in parallel between
different shard
• The Lambda consumer will not continue to the next batch until the function
finishes successfully. Make sure to handle exception in your Lambda functions
or you can get your shard stuck!
• If your functions are slow, consider using a “forward” Lambda: a function that
will only read the records from Kinesis and then will activate another Lambda
function (which can run in parallel)
• Understand the KCL, it will help you adjust the different parameters to your
needs
54. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Best Practice
• Make sure to monitor:
• Iterator age
• Put records count
• Records distribution across shards
• Each Lambda on it’s own
• Your own required metrics
Monitoring
55. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
• Duplicate records are almost inevitable
• It’s easier to just handle duplicates than to avoid the problem
• Watch out - updates are not duplicates!
• It is possible to keep records order, but will likely to cause a larger overhead. Try
to avoid relying on records order as much as possible.
Best Practice
Duplication handling & order considerations
56. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Best Practice
• It’s recommended that the first setup will be done manually (using CLI)
• We found Serverless (by Serverless, Inc.) to be the most suitable for Kinesis
with Lambda usage (for Web APIs that might not be the case)
Frameworks
57. Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.coNutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Thank you!