Analyzing Real-time Streaming Data with Amazon Kinesis

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roy Ben-Alta, Amazon AI and Analytics, BDM
Daniel Modlinger, VP R&D, Nutrino
June 21st, 2017
Real-Time Streaming Analytics
Overview Of Amazon Kinesis

What to expect from this session
• Real-Time Streaming Analytics – Why Now?
• Amazon Kinesis Overview
• New Features
• Design Patterns
• Nutrino Case Study

Most data is produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test

What is your decision latency?
OR

The diminishing value of data
Recent data is highly valuable
• If you act on it in time
• Perishable Insights (M. Gualtieri, Forrester)
Old + Recent data is more valuable
• If you have the means to combine them

Processing real-time, streaming data
• Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Ingest Transform Analyze React Persist

Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Machine Learning and
Actionable Insights
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Operation
monitoring
Security
DevOps tools,
Ingesting
VPCFlowLogs
Subscribe to
CloudwatchLogs and
analyze logs in Real-
Time
Anomaly Detection

Amazon Kinesis
Kinesis FirehoseKinesis Analytics
Platform for streaming data on AWS
Kinesis Streams

Amazon Kinesis
Kinesis FirehoseKinesis AnalyticsKinesis Streams
DeliverProcessCapture

Amazon Kinesis
DeliverProcess
For Technical Developers
Build your own custom
applications that process
or analyze streaming data

Amazon Kinesis
Deliver
For All Developers, Data
Scientists, Business Analysts
Easily analyze data streams
using standard SQL queries

Amazon Kinesis
For All Developers, Data
Scientists, Business Analysts
Easily analyze data streams
using standard SQL queries
For All Developers
Easily load massive volumes of
streaming data into S3,
Redshift and Elasticsearch,
clean and format using Lambda

Amazon Kinesis Customer Base Diversity
1 billion events/wk from
connected devices | IoT
17 PB of game data per
season | Entertainment
80 billion ad
impressions/day, 30 ms
response time | Ad Tech
100 GB/day click streams
from 250+ sites |
Enterprise
50 billion ad
impressions/day sub-50
ms responses | Ad Tech
10 million events/day
| Retail
Ingesting 2M+ network
events every second
Funnel all
production events
through Amazon
Kinesis

Why are these customers choosing Amazon Kinesis?
Lower costs
Performant without
heavy lifting
Scales elastically
Increased agility
Secure and visible
Plug and play

Amazon Kinesis Streams
Build your own data streaming applications
Easy administration: Simply create a new stream, and set the desired level of capacity with
shards. Scale to match your data throughput rate and volume.
Build real-time applications: Perform continual processing on streaming big data using
Kinesis Client Library (KCL), Kinesis Analytics, Apache Spark/Storm, AWS Lambda, and more.
Low cost: Cost-efficient for workloads of any scale.

• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours – 7 days
• Scale Kinesis streams by splitting or
merging Shards
• Replay data inside of 24Hr -7days
Window
Amazon Kinesis Stream
Managed Ability to capture and store Data
Time-based
seek

Amazon Kinesis Streams pricing
Simple, pay-as-you-go, and no up-front costs
Dimension Value
Shard Hour $0.015
PUT Payload (per 1,0000,000 PUTs) $0.014
Extended Data Retention (Up to 7 days),
per Shard Hour
$0.020

Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Machine Learning and
Actionable Insights
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
Operation
Security
DevOps tools, Ingesting
VPCFlowLogs
Subscribe to
CloudwatchLogs and
analyze logs in Real-Time
Anomaly Detection

Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift
and Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other
destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data
destinations in as little as 60 secs using simple configurations.
Elastic: Scales to match data throughput w/o intervention
Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming
source data.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Redshift
and Elasticsearch

Amazon Kinesis Firehose pricing
Simple, pay-as-you-go, and no up-front costs
Dimension Value
Per 1 GB of data ingested $0.029 - *$0.020
*Tier pricing

Amazon Kinesis Firehose or Amazon Kinesis Streams?

Kinesis Firehose vs. Kinesis Streams
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a
data latency of 60 seconds or higher.

Amazon Kinesis Analytics
Apply SQL on streams: Easily connect to a Kinesis Stream or Firehose Delivery
Stream and apply SQL skills.
Build real-time applications: Perform continual processing on streaming big data
with sub-second processing latencies.
Easy Scalability : Elastically scales to match data throughput.
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Kinesis Analytics can send processed data
to analytics tools so you can create alerts
and respond in real-time

Use SQL to build real-time applications
Easily write SQL code to process
streaming data
Connect to streaming source
Continuously deliver SQL results

Amazon Kinesis Streams
Forthcoming Features
Encryption at
Rest
VPCe

Amazon Kinesis Firehose
VPCe:
RedShift
ElasticSearch
Parquet Format AWS RegionsKafka Connector

Amazon Kinesis Analytics
Amazon
Lambda
Integration
Amazon
QuickSight
Destination
Improved
Autoscaling

Capture all of the value from your data with Amazon Kinesis
Amazon
S3
Amazon
Kinesis
Analytics
Amazon Kinesis–
enabled app
Amazon Kinesis
Streams
Ingest Process React Persist
AWS
Lambda
0ms 200ms 1-2 s
Amazon
QuickSight
Amazon Kinesis
Firehose
Amazon
Redshift

Serverless Analytics – Enable your org with 0 operation overhead
Amazon S3
Ingest
Streaming
ETL
Persist Analyze
AWS
Lambda
O msec seconds < 5 minutes
Amazon Kinesis
Firehose
Amazon
Redshift
Amazon
Elasticsearch
Amazon
Athena
Amazon
Kinesis
Analytics
Amazon
Redshift
Spectrum
Amazon
Kinesis
Streams

Amazon Kinesis Data Generator on GitHub
https://awslabs.github.io/amazon-kinesis-data-generator/

Putting Data in Amazon Kinesis
AWS SDK
• PutRecord()
• PutRecordBatch()
Kinesis Agent
• Monitors files and sends new data records to your delivery stream
• Handles file rotation, checkpointing, and retry upon failures
• Pre-processing capabilities such as format conversion and log parsing
• Emits AWS CloudWatch metrics for monitoring and troubleshooting

Amazon EMR
Amazon
Kinesis
StreamsStreaming Input
Tumbling/Fixed
Window
Aggregation
Periodic Output
Amazon Redshift
COPY from
Amazon EMR
Common Integration Pattern with Amazon EMR
Tumbling Window Reporting

Amazon Kinesis Streams with AWS Lambda

Build your Streaming Pipeline with Amazon
Kinesis, AWS Lambda and Amazon EMR

Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.coNutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Daniel Modlinger
VP R&D
AWS Summit Tel Aviv

Nutrino proprietary and confidential All rights reserved info@nutrino.co www.nutrino.co
Nutrition advice - conflicted
and not personal
$435B

Food Analysis System
Scientific Literature
Food
(10,000+ papers)
(1,000,000+ foods)
(300,000+ restaurant locations)
Menus
FoodPrint™
Correlations
Insights
Predictions
Personal & continually
improving nutrition recommendations
Activity
Sleep
Continuous Glucose
Blood Tests
DNA - Genome
Taste, allergies etc.
Mood
Blood pressure
Temperature
Biome
Social media
Individual Inputs

Examples - Partners
Pregnancy
Diabetes
ConsumerEmployer/payer
Retail
confidential
Hypertension
confidential

Data Streaming

FoodPrint™
Activity
Sleep
Continuous Glucose
Blood Tests
DNA - Genome
Taste, allergies etc.
Mood
Blood pressure
Temperature
Biome
Social media
Individual Inputs

Needs
Volume - Millions of daily data points
Scale - Non linear growth
Performance - Near real time, as low as 3rd party producers of data
allow
Extract as much data as possible and store unknown data types and
formats
Process and store a wide variety of known data types
Goal: Process and store nutrition-related sensor data

First attempts
2015 2016 2017

Updated Needs (AKA lessons learned)
Cut the middle man
Detailed monitoring is a necessity
Be event driven
Fast developer response to changes in APIs
Reliability is key, otherwise data gets lost
Goal: Process and store nutrition related sensors data

Current Solution - High Level Architecture

Lambda allows us to write simple and clean code that will
deal with each entity without the need to be aware of flow
or context
Why Kinesis & Lambda (1)
Process and store sensor data and all other nutrition-related data
Scale - Non linear growth
Performance - Near real time
Store unknown data types and formats
Wide variety of known data types
All unknow data is saved to S3 using Kinesis Firehose,
which is later analyzed by our Spark cluster
With Lambda triggers on streams we can act
quickly enough
Scaling can be as easy as adding a shard, Lambda
functions are executed as much as we need

Volume - Millions of daily data points
S3 is limitless, Kinesis streams are practically limitless
Reliability is key, otherwise data gets lost
Allow fast developer response to changes
in APIs
Be event driven wherever possible
because polling causes problems
Kinesis streams are persistent (from 24h up to one
week), Lambda functions automatically retry,
cloudwatch has detailed monitoring of everything
Using API Gateway and Lambda integration allows 3rd
party producers to activate the processing
Changing one Lambda function does not affect the flow,
the deploy resolution can be as low as a single function

Detailed monitoring is a necessity and
not a luxury
Cut the middle man - avoid writing
infrastructure to allow clients (iOS)
write sensor data
Iterator Age, put count, Lambda monitoring are a good
baseline to start from
We are using AWS iOS SDK to write the sensor data

• Optimal shards number is a derivative of desired iterator age (aka latency)
• Optimal batch size - If your Lambda scales in a linear fashion then it’s doesn’t
really matters..
• A high number will cause less calls and can create an internal queue for the
consumer Lambda function (which will increase the latency to the user)
• A low number can be expensive in terms of Lambda overhead and price.
Scale & Performance (1)
Best Practice

Scale & Performance (2)
Best Practice
• Lambda functions run sequentially on the shard’s data, but in parallel between
different shard
• The Lambda consumer will not continue to the next batch until the function
finishes successfully. Make sure to handle exception in your Lambda functions
or you can get your shard stuck!
• If your functions are slow, consider using a “forward” Lambda: a function that
will only read the records from Kinesis and then will activate another Lambda
function (which can run in parallel)
• Understand the KCL, it will help you adjust the different parameters to your
needs

Best Practice
• Make sure to monitor:
• Iterator age
• Put records count
• Records distribution across shards
• Each Lambda on it’s own
• Your own required metrics
Monitoring

• Duplicate records are almost inevitable
• It’s easier to just handle duplicates than to avoid the problem
• Watch out - updates are not duplicates!
• It is possible to keep records order, but will likely to cause a larger overhead. Try
to avoid relying on records order as much as possible.
Best Practice
Duplication handling & order considerations

Best Practice
• It’s recommended that the first setup will be done manually (using CLI)
• We found Serverless (by Serverless, Inc.) to be the most suitable for Kinesis
with Lambda usage (for Web APIs that might not be the case)
Frameworks

Thank you!

Analyzing Real-time Streaming Data with Amazon Kinesis

Analyzing Real-time Streaming Data with Amazon Kinesis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Analyzing Real-time Streaming Data with Amazon Kinesis

Similar to Analyzing Real-time Streaming Data with Amazon Kinesis (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Analyzing Real-time Streaming Data with Amazon Kinesis