Realtime Analytics on AWS

Real-time Analytics on
Sungmin, Kim
Solutions Architect, AWS

Agenda
• Why Real-time Data streaming and Analytics?
• How to Build?
• Where to Store streaming data?
• How to Ingest streaming data?
• How to Process streaming data?
• Delivery Streaming Data
• Dive into Stream Process Framework
• Transform, Aggregate, Join Streaming Data
• Case Studies
• Key Takeaways

Why Real-time Data streaming and Analytics?

Data
The world’s most
valuable resource is
no longer oil, but data.*
*Copyright: David Parkins , The Economist, 2017
“
”

Data Loses Value Over Time
* Source: Mike Gualtieri, Forrester, Perishable insights
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional “batch” business intelligence

To create Value, derive insights in Real-time

Batch vs Real-time
Batch Difference Real-time
Arbitrarily, or Periodically Continuity Constant
Store → Process
(Hadoop MapReduce, Hive, Pig, Spark)
Method of analysis
Process → Store
(Spark Streaming, Flink, Apache Storm)
Small - Huge (KB~TB) Data size per a unit Small (B~KB)
Low - High (minutes to hours) Query Latency Low (milliseconds to minutes)
Low - High (hourly/daily/monthly) Request Rate Very High - High (in seconds, minutes)
High - Very high Durability Low - High
¢~$ (Amazon S3, Glacier) Cost/GB $$~$ (Redis, Memcached)

From Batch to Real-time:
Lambda Architecture
Data
Source
Stream
Storage
Speed Layer
Batch Layer
Batch
Process
Batch
View
Real-
time
View
Consumer
Query & Merge
Results
Service Layer
Stream
Ingestion
Raw Data
Storage
Streaming Data
Stream
Delivery
Stream
Process

Lambda Architecture
Streaming
Data
Batch View
Stream Process
Real-time
View
Query
Query
Batch View
Real-time
View
Raw Data
Batch Process
Batch Layer Serving Layer
Speed Layer

Key Components of Real-time Analytics
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Devices and/or
applications that
produce real-time
data at high
velocity
Data from tens of
thousands of data
sources can be
written to a single
stream
Data are stored in the
order they were
received for a set
duration of time and
can be replayed
indefinitely during
that time
Records are read in
the order they are
produced, enabling
real-time analytics
or streaming ETL
Data lake
(most common)
Database
(least common)

Where to Store Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink

Stream Storage
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka

Hash
Function
Consumer
Consumer
Consumer
Consumer Group
PK
PK
PK
PK
= next consumer offset oldest datanewest data
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
Producers
shard/partition-1
shard/partition-2
5 4 3 2 1 0
3 2 1 0
4 3 2 1 0
4
2
0
shard/partition-3

Why is Stream Storage?
• Decouple producers &
consumers
• Persistent buffer
• Collect multiple streams
• Preserve client ordering
• Parallel consumption
• Streaming MapReduce

• Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• No client ordering (standard)
• FIFO queue preserves client
ordering
• No streaming MapReduce
• No parallel consumption
• Amazon SNS can publish to
multiple SNS subscribers
(queues or Lambda functions)
Consumers
4 3 2 1
12344 3 2 1
1234
2134
13342
Standard
FIFO
Producers
Amazon SQS Queue
What about SQS?
Publisher
Amazon SNS
Topic
AWS Lambda
function
Amazon SQS
queue
Queue
Subscriber

Topic
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka

Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
• Operational Considerations
• Number of clusters?
• Number of brokers per cluster?
• Number of topics per broker?
• Number of partitions per topic?
• Only increase number of
partitions; can’t decrease
• Integration with a few of AWS
Services such as Kinesis Data
Analytics for Java
• Operational Considerations
• Number of Kinesis Data Streams?
• Number of shards per stream?
• Increase/Decrease number of
shards
• Fully Integration with AWS
Services such as Lambda
function, Kinesis Data Analytics,
etc

RequestQueue
- Length
- WaitTime
ResponseQueue
- Length
- WaitTime
Network
- Packet Drop?
Produce/Consume Rate Unbalance
Who is Leader? Disk Full?
Too many topics?
Metrics to Monitor: MSK (Kafka)

Metrics to Monitor: MSK (Kafka)
Metric Level Description
ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time.
OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster.
GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster.
GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster.
KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs.
KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs.
RootDiskUsed DEFAULT The percentage of the root disk used by the broker.
PartitionCount PER_BROKER The number of partitions for the broker.
LeaderCount PER_BROKER The number of leader replicas.
UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker.
UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker.
FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on
fetching data from the broker.
ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.

How about monitoring Kinesis Data Streams?
Consumer
Application
GetRecords()
Data
How long time does a record stay in a shard?

Metrics to Monitor: Kinesis Data Streams
Metric Description
GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords
ReadProvisionedThroughputExceeded Number of GetRecords calls throttled
WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled
PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations
GetRecords.Success Number of successful GetRecords operations

Choosing Good Metrics
Too much information can be just as useless as too little

How to Ingest Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink

Stream Ingestion
• AWS SDKs
• Publish directly from application code via APIs
• AWS Mobile SDK
• Kinesis Agent
• Monitors log files and forwards lines as messages to
Kinesis Data Streams
• Kinesis Producer Library (KPL)
• Background process aggregates and batches messages
• 3rd party and open source
• Kafka Connect (kinesis-kafka-connector)
• fluentd (aws-fluent-plugin-kinesis)
• Log4J Appender (kinesis-log4j-appender)
• and more …
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Amazon Kinesis
Data Streams

How to Process Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink

Elasticsearch
Redshift
Stream Delivery
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Stream
Delivery
Kinesis
Data Firehose
• Kinesis Agent
• CloudWatch Logs
• CloudWatch Events
• AWS IoT
• Direct PUT using APIs
• Kinesis Data Streams
• MSK(Kafka) using
Kafka Connect
Kinesis
Data Analytics
S3

Kinesis Firehose: Filter, Enrich, Convert
Data
Source
apache log
apache log
json Data
Sink
[Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178]
[Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1]
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
}
json
Lambda function
Kinesis
Data Firehose

Pre-built Data Transformation Blueprints
Blueprint Description
General Processing For custom transformation logic
Apache Log to JSON
Parses and converts Apache log lines to JSON objects using predefined
JSON field names
Apache Log to CSV Parses and converts Apache log lines to CSV format
Syslog to JSON
Parses and converts Syslog lines to JSON objects using predefined JSON
field names
Syslog to CSV Parses and converts Syslog lines to CSV format

Pre-built Data Conversion
Data
Source
Kinesis
Data Firehose
JSON Data
schema
AWS Glue Data
Catalog
Amazon S3
• Convert the format of your input data from JSON to columnar data
format Apache Parquet or Apache ORC before storing the data in
Amazon S3
• Works in conjunction to the transform features to convert other format
to JSON before the data conversion
convert to
columnar format
/failed

Failure and Error Handling
• S3 Destination
• Pause and retry for up to 24 hours (maximum data retention period)
• If data delivery fails for more than 24 hours, your data is lost.
• Redshift Destination
• Configurable retry duration (0-2 hours)
• After retry, skip and load error manifest files to S3’s errors/ folder
• Elasticsearch Destination
• Configurable retry duration (0-2 hours)
• After retry, skip and load failed records to S3’s elasticsearch_failed/
folder

Stream Process
• Transform
• Filter, Enrich, Convert
• Aggregation
• Windows Queries
• Top-K Contributor
• Join
• Stream-Stream Join
• Stream-(External) Table Join
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
AWS Lambda
Amazon Kinesis
Data Analytics
AWS Glue Amazon EMR

Dive into Stream Process
Services

AWS Lambda
• Serverless functions
• Event-based, stateless processing
• Continuous and simple scaling mechanism
event
(3)
event
(2)
event
(1)
Lambda
(1)
Lambda
(2)
Lambda
(3)

Amazon Kinesis
Data Analytics
AWS Glue Amazon EMR
Serverless ServerlessFully Managed

Architecture: Master-Worker
Master
Worker
(1)
Worker
(2)
Worker
(3)
part-01
part-02
part-03
part-01
part-02
part-03

Treat Streams as Unbounded Tables

“It's raining cats and dogs!”
["It's", "raining", "cats", "and", "dogs!"]
[("It's", 1), ("raining", 1), ("cats", 1),
("and", 1), ("dogs!", 1)]
It’s 1
raining 1
cats 1
and 1
dogs! 1

Setup session
Read stream
Start
running
Apply
Streaming
ETL

What about (Stream) SQL?
Data
Source
Stream
Storage
Stream
SQL
Process
Stream
Ingestion
Data
Sink
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
“It's raining cats and dogs!” It’s 1
raining 1
cats 1
and 1
dogs! 1

Kinesis Data Analytics (SQL)
• STREAM (in-application): a continuously
updated entity that you can SELECT from and
INSERT into like a TABLE
• PUMP: an entity used to continuously
'SELECT ... FROM' a source STREAM, and
INSERT SQL results into an output STREAM
• Create output stream, which can be used to
send to a destination
SOURCE
STREAM
INSERT
& SELECT
(PUMP)
DESTIN.
STREAM
Destination
Source
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]

Kinesis Data
Analytics
SQL vs Java

https://aws.amazon.com/ko/blogs/aws/new-amazon-kinesis-data-analytics-for-java/
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon S3Amazon Kinesis
Data Analytics
(Java)
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
(SQL)
DEMO: Word Count
“It's raining cats and dogs!”
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
It’s 1
raining 1
cats 1
and 1
dogs! 1
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
1
2

Filter, Enrich, Convert
Streaming Data
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink

Revisit Example: Filter, Enrich, Convert
Data
Source
Kinesis
Data Firehose
apache log
apache log
json Data
Sink
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
}
json
Lambda function

Stream Process: Filter, Enrich, Convert
Data
Source
apache log
apache log
json Data
Sink
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
}
Amazon Kinesis
Data Streams
Lambda function
Amazon EMR AWS GlueAmazon Kinesis
Data Analytics

Stream Process: Filter, Enrich, Convert
Data
Source
apache log
apache log
json Data
Sink
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
} Amazon EMR AWS Glue
Amazon MSK
Amazon Kinesis
Data Analytics
(Java)

Kinesis Data Analytics (SQL):
Preprocessing Data
https://aws.amazon.com/ko/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/

Integration of Stream Process and
Stream Storage
Amazon
Lambda
Kinesis Data
Analytics (SQL)
Kinesis Data
Analytics (Java)
Glue EMR
Kinesis Data
Firehose O O X X X
Kinesis Data
Streams O O O O O
Managed
Streaming for
Kafka (MSK)
X X O O O
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink

Aggregate Streaming Data
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink

Stream Process: Aggregation
• Aggregations (count, sum, min,...) take granular real time data
and turn it into insights
• Data is continuously processed so you need to tell the
application when you want results
• Windowed Queries
a. Sliding Windows (with Overlap)
b. Tumbling Windows (No Overlap)
c. Custom Windows

Join Streaming Data
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink

Stream Process: Join
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
Data
Source
Stream
Storage
Data
Source
Stream
Process
(a) Stream-Stream Join
(b) Stream-Join by Partition Key
(c) Stream-Join by Hash Table
Data
Source
Stream
Storage
Stream
Process
Key-Value
Storage

Why Stream-Stream Join is so difficult?
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
Data
Sink
t0t1t2tN
. . . . . . .
• Timing
• Skewed data
∆𝑡
∆𝑡
∆𝑡

How about Stream-Join by Partition Key?
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
Data
Source
Stream
Storage
Data
Source
Stream
Process
t1t2t3t5
t1t2t3t5
t1t1t2t3
Each shard will be filled with records
coming from fast data producers
shard-1
shard-2
shard-3

Lastly, how about Stream-Join by Hash
Table?
Data
Source
Stream
Storage
Stream
Process
Key-Value
Storage
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process

{"TICKER_SYMBOL": "CVB",
"SECTOR": "TECHNOLOGY",
"CHANGE": 0.81,
"PRICE": 53.63}
{"TICKER_SYMBOL": "ABC",
"SECTOR": "RETAIL",
"CHANGE": -1.14,
"PRICE": 23.64}
{"TICKER_SYMBOL": "JKL",
"SECTOR": "TECHNOLOGY",
"CHANGE": 0.22,
"PRICE": 15.32}
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
(SQL)
DEMO: Filter, Aggregate, Join
Continuous filter
Aggregate function
Data enrichment (join)
Bucket
with objects
Ticker,Company
AMZN,Amazon
ASD,SomeCompanyA
BAC,SomeCompanyB
CRM,SomeCompanyC
https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-add-reference-data.html

Comparing Stream Process
Services

DevOps! Master-Worker Framework
Master
Worker
(1)
Worker
(2)
Worker
(3)
part-01
part-02
part-03
part-01
part-02
part-03
Master is alive?
Worker has enough resources such as CPU, Memory, Disk?
Checkpoint?
Right Instance Type?
C-Family, or R-Family?
Learning curve?
- SQL
- Python
- Scala
- Java

EMR vs Glue vs Kinesis Data Analytics
Operational
Excellence
Kinesis Data
Analytics (SQL)
EMR
Glue
Kinesis Data
Analytics (Java)
Degree of Freedom
≈ Complexity

AWS Glue
Comparing stream processing services
AWS Lambda Amazon Kinesis
Data Analytics
Amazon EMR
Simple programming
interface and scaling
• Serverless functions
• Six languages (Java,
Python, Golang,
Node.js, Ruby, C#)
• Event-based, stateless
processing
• Continuous and simple
scaling mechanism
Easy and powerful
stream processing
Simple, flexible, and
cost-effective ETL & Data
Catalog
Flexibility and choice for
your needs
• Serverless applications
• Supports SQL and Java
(Apache Flink)
• Stateful processing
with automatic backups
• Stream operators make
building app easy
• Serverless applications
• Can use the transforms
native to Apache Spark
Structured Streaming
• Automatically discover
new data, extracts
schema definitions
• Automatically
generates the ETL code
• Choose your instances
• Use your favorite
open-source
framework
• Fine-grained control
over cluster,
debugging tools, and
more
• Deep open-source tool
integrations with AWS

Example Usage Pattern 1: Web Analytics
and Leaderboards
Amazon
DynamoDB
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Streams
Amazon
Cognito
Lightweight JS
client code
Web server on
Amazon EC2
OR
Compute top 10 usersIngest web app data Persist to feed live apps
Lambda
function
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/

https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/
Example Usage Pattern 2: Monitoring
IoT Devices
Ingest sensor data
Convert json
to parquet
Store all data points
in an S3 data lake
https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/

Example Usage Pattern 3: Analyzing
AWS CloudTrail Event Logs
AWS
CloudTrail
CloudWatch
Events trigger
Kinesis
Data Analytics
Lambda
function
S3 bucket
for raw data
DynamoDB
table
Chart.JS
dashboard
Compute operational
metrics
Ingest raw log data Deliver to real time
dashboards and archival
Kinesis Data
Firehose
https://aws.amazon.com/solutions/implementations/real-time-insights-account-activity/

Key Components of Real-time Analytics
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
AWS Lambda
Kinesis Data
Analytics
Glue EMR
Kinesis Data
Firehose
Kinesis Data
Streams
Managed
Streaming for
Kafka
Real-Time Applications
- Aggregation
- Top-K Contributor
- Anomaly Detection
Streaming ETL
- Filter, Enrich, Convert
- Join
Kafka Connect
KPL
Kinesis Agent
AWS SDKs

Key Takeaways
• Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
• Data Source → Stream Ingestion → Stream Storage → Stream
Process → Data Sink
• Follow the principle of "extract data once and reuse multiple
times” to power new customer experiences
• Use the right tool for the job
• Know the AWS services soft and hard limits
• Leverage managed and serverless services (DevOps!)
• Scalable/elastic, available, reliable, secure, no/low admin

Where To Go Next?
• AWS Analytics Immersion Day - Build BI System from Scratch
• Workshop - https://tinyurl.com/yapgwv77
• Slides - https://tinyurl.com/ybxkb74b
• Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1, 2
• Part1 - https://tinyurl.com/y8vo8q7o
• Part2 - https://tinyurl.com/ycbv7wel
• Streaming Analytics Workshop – Kinesis Data Analytics for Java (Flink)
https://streaming-analytics.labgui.de/
• Amazon MSK Labs
https://amazonmsk-labs.workshop.aws/
• Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming
https://tinyurl.com/y7hklyff
• AWS Glue Streaming ETL - Scala Script Example
https://tinyurl.com/y79x6jda

Appendix
• Amazon Managed Streaming for Apache Kafka: Best Practices
https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html
• Optimizing Your Apache Kafka® Deployment
https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
• Monitoring Kafka performance metrics
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Realtime Analytics on AWS

More Related Content

Similar to Realtime Analytics on AWS

More from Sungmin Kim

Recently uploaded

Realtime Analytics on AWS