Get answers to technical questions, frequently asked by those starting to work with streaming data. Learn best practices for building a real-time streaming data architecture on AWS with Amazon Kinesis, Spark Streaming, AWS Lambda, and Amazon EMR. First, we will focus on building a scalable, durable streaming data ingestion workflow from data producers like mobile devices, servers, or even web browsers. We will provide guidelines to minimize duplicates and achieve exactly-once processing semantics in your stream-processing applications. Then, we will show some of the proven architectures for processing streaming data using a combination of tools including Amazon Kinesis Stream, AWS Lambda, and Spark Streaming running on Amazon EMR.
2. Learning objectives
• Introduction to streaming data
• Use cases and design patterns
• Amazon Kinesis Streams data ingestion: top 6 things to
know
• Stream processing overview:
- Understanding your options
- Amazon Kinesis Client Library (KCL) overview
- Spark Streaming on Amazon EMR: overview best practices
- Amazon Kinesis Analytics (new)
• Q & A
3. Batch Processing
Hourly server logs
Weekly or monthly bills
Daily website clickstream
Daily fraud reports
Stream Processing
Real-time metrics
Real-time spending alerts/caps
Real-time clickstream analysis
Real-time detection
It’s all about the pace
4. Metering Record Common Log Entry
MQTT RecordSyslog Entry
{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
{
127.0.0.1 user-
identifier frank
[10/Oct/2000:13:5
5:36 -0700] "GET
/apache_pb.gif
HTTP/1.0" 200
2326
}
{
"SeattlePublicWa
ter/Kinesis/123/
Realtime" –
412309129140
}
{
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog -
ID47 [exampleSDID@32473 iut="3"
eventSource="Application"
eventID="1011"][examplePriority@
32473 class="high"]
}
Streaming data challenges: variety & velocity
• Streaming data comes in
different types and
formats
− Metering records,
logs, and sensor
data
− JSON, CSV, TSV
• Can vary in size from a
few bytes to kilobytes or
megabytes
• High velocity and
continuous
5. Streaming data scenarios across verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Responsive Data
Analysis
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and conversion
User engagement with ads,
optimized bid/buy engines
IoT
Sensor, device
telemetry,
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming
Online data aggregation,
e.g., top 10 players
Massively multiplayer online
game (MMOG) live dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics
Metrics like impressions and
page views
Recommendation engines,
proactive care
6. Customer use cases
Sonos runs near real-time streaming
analytics on device data logs from
their connected hi-fi audio equipment
Analyzing 30 TB+ clickstream
data, enabling real-time insights for
publishers
Glu Mobile collects billions of gaming
events data points from millions of user
devices in real-time every single day
Nordstorm recommendation team built
online stylist using Amazon Kinesis
Streams and AWS Lambda
8. Amazon Kinesis: streaming data made easy
Services make it easy to capture, deliver, and process streams on AWS
Amazon Kinesis Firehose
For all developers, data
scientists, IT professionals
Transform and load streaming
data into Amazon S3, Amazon
Redshift, Amazon Elasticsearch
Service, and more
Amazon Kinesis
Analytics
For all developers, analysts
and data scientists
Easily analyze streaming data
using standard SQL
Amazon Kinesis
Streams
For technical developers
Build custom application to
process or analyze streaming
data
9. Amazon Kinesis Firehose
Load streaming data into Amazon S3, Amazon Redshift, and Amazon ES
Zero administration: Deliver streaming data into Amazon S3, Amazon Redshift, Amazon ES,
and other destinations without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 seconds using simple configurations.
Seamless elasticity: Seamlessly scale to match data throughput without intervention.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Amazon
Redshift, and Amazon ES
12. What stream storage should I use?
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
DynamoDB
Streams
AWS Managed
Service
Yes Yes No Yes
Ordering Yes Yes Yes Yes
Data Retention
Period
7 days N/A Manual/
Configurable
24 hours
Availability 3 AZ 3 AZ Manual/
Configurable
3 AZ
Scale /
Throughput
No Limit /
~ Shards
No limit /
Automatic
No limit /
~ Nodes
No limit /
~ Table IOPS
Parallel Clients Yes No Yes Yes
Stream MapReduce Yes N/A Yes Yes
Record/object size 1MB 1 MB Configurable 400 KB
Cost Low Low Low (+admin) Higher (table cost)
13. • Streams are made of shards
• Each shard ingests up to 1 MB/sec,
and 1000 records/sec
• Each shard emits up to 2 MB/sec
• All data is stored for 24 hours by
default; storage can be extended for
up to 7 days
• Scale Amazon Kinesis Streams using
scaling util
• Replay data inside of 24-hour window
Amazon Kinesis Streams
Managed ability to capture and store data
14. Managed Buffer
• Care about a reliable, scalable
way to capture data
• Defer all other aggregation to
consumer
• Generate random partition
keys
• Ensure a high cardinality for
partition keys with respect to
shards, to spray evenly across
available shards
Topic #1: framing your data ingestion pattern
Workload determines partition key strategy
Streaming MapReduce
• Streaming MapReduce: leverage
partition keys as a natural way to
aggregate data
• For e.g., partition keys per billing
customer, per DeviceId, per stock
symbol
• Design partition keys to scale
• Be aware of “hot partition keys or
shards”
15. Topic #2: putting data into Amazon Kinesis
Options range from AWS SDKs to agents
• Amazon Kinesis Agent – (supports preprocessing)
http://docs.aws.amazon.com/firehose/latest/dev/writing-with-
agents.html
• Amazon Kinesis Producer Library
http://docs.aws.amazon.com/firehose/latest/dev/writing-with-
agents.html
• Use open source log data aggregators
Consider Flume, Fluentd as collectors/agents
See https://github.com/awslabs/aws-fluent-plugin-kinesis
16. Topic #3: sizing the Amazon Kinesis stream
Provision adequate shards, you can always change them
• For ingress needs – capture all incoming data
• Count likely data producers – log servers, sensor/ things,
smartphone app installs
• Individual payload size and (desired) frequency of Puts
• For egress needs – feed all consuming applications
• Each shard can do 2 MB/sec on egress
• Add more shards for more applications
• Include head-room for “catch-up” with data in stream in
the event of application failures
18. Topic #5: enabling enhanced monitoring
Get per-shard metrics for Amazon Kinesis Streams
• IncomingBytes
• IncomingRecords
• OutgoingBytes
• OutgoingRecords
• WriteProvisionedThroughputExceeded
• ReadProvisionedThroughputExceeded
• IteratorAgeMilliseconds
• ALL
19. Topic #6: when to use Firehose or
Amazon Kinesis Streams
Depends on your use case and nature of data
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon ES, and a data
latency of 60 seconds or higher.
21. What stream processing technology should I use?
KCL
Application
Spark
Streaming
AWS Lambda Amazon
Kinesis
Analytics
Apache
Storm
Scale ~ Nodes ~ Nodes Automatic Automatic ~ Nodes
Micro-Batch or
Real-Time
Near real-time Micro-batch Near real-time Near real-time Real-time
AWS Managed
Service
No (KCL +
EC2 + Auto
Scaling)
Yes (via EMR) Yes Yes No (EC2)
Scalability No limits
~ Nodes
No limits
~ Nodes
No limits No Limits No limits
~ Nodes
Availability Multi-AZ Single AZ Multi-AZ Multi-AZ Configurable
Programming
Languages
Java, via
MultiLang
Daemon
(.NET, Python,
Ruby, Node.js)
Java, Python,
Scala
Node.js, Java,
Python
SQL Any language
via Thrift
23. Amazon KCL
Open source library to build your own apps
• Build Amazon Kinesis applications with KCL
• Available for Java, Ruby, Python, Node.JS dev
• Deploy on your EC2 instances
• KCL application includes three components:
1. Worker – processing unit that maps to each application EC2
instance
2. Record Processor Factory – creates the record processor
3. Record Processor – processor unit that processes data from a
shard in Amazon Kinesis Streams
24. State management with KCL
Enables scalable, at-least once-processing
• One record processor maps to one shard and processes data
records from that shard
• One worker maps to one or more record processors
• Balances shard-worker associations when worker/instance counts
change
• Balances shard-worker associations when shards split or merge
26. Apache Spark and Amazon Kinesis Streams
• Apache Spark is an in-memory analytics cluster using RDD for fast
processing
• Spark Streaming can read directly from an Amazon Kinesis stream
• Amazon software license linking – add ASL dependency to
SBT/MAVEN project, artifactId = spark-streaming-
kinesis-asl_2.10
KinesisUtils.createStream(‘twitter-stream’)
.filter(_.getText.contains(”Open-Source"))
.countByWindow(Seconds(5))
Example: Counting tweets on a sliding window
27. Using Spark Streaming with Amazon Kinesis Streams
Key configurations to bear in mind for Spark 1.6+ on Amazon EMR
• Leverage EMRFS consistent view option – if you use Amazon S3
as storage for Spark checkpoint
• DynamoDB table name – make sure there is only one instance of
the application running with Spark Streaming
• Enable Spark-based checkpoints
• Number of Amazon Kinesis receivers is multiple of executors so
they are load-balanced
• Total processing time is less than the batch interval
• Number of executors is the same as number of cores per
executor
• Spark Streaming uses default of 1 sec with KCL
30. Amazon Kinesis Analytics
Analyze data streams continuously with standard SQL
Apply SQL on streams: easily connect to a stream or Firehose and apply SQL
Build real-time applications: continuous processing on streaming big data
Fully managed: no servers to manage - elastically scalable
Connect to
Amazon Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Analytics can send processed data to
analytics tools so you can create alerts
and respond in real time
31.
32. Conclusion
• Amazon Kinesis offers: managed service to build applications, streaming
data ingestion, and continuous processing
• Stream ingest: Set up partition key model upfront, determine producer-side
software, leverage detailed metrics
• Stream process: Pick right tool for right job
• Try out the Amazon Kinesis platform: http://aws.amazon.com/kinesis/
33. • Technical documentations
• Amazon Kinesis Agent
• Amazon Kinesis Streams and Spark Streaming
• Amazon Kinesis Producer Library Best Practice
• Amazon Kinesis Firehose and AWS Lambda
• Building Near Real-Time Discovery Platform with Amazon Kinesis
• Public case studies
• Glu mobile – Real-Time Analytics
• Hearst Publishing – Clickstream Analytics
• How Sonos Leverages Amazon Kinesis
• Nordstorm Online Stylist
Reference