Real-time Analytics on
Sungmin, Kim
Solutions Architect, AWS
Agenda
• Why Real-time Data streaming and Analytics?
• How to Build?
• Where to Store streaming data?
• How to Ingest streaming data?
• How to Process streaming data?
• Delivery Streaming Data
• Dive into Stream Process Framework
• Transform, Aggregate, Join Streaming Data
• Case Studies
• Key Takeaways
Why Real-time Data streaming and Analytics?
Data
The world’s most
valuable resource is
no longer oil, but data.*
*Copyright: David Parkins , The Economist, 2017
“
”
Data Loses Value Over Time
* Source: Mike Gualtieri, Forrester, Perishable insights
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional “batch” business intelligence
To create Value, derive insights in Real-time
Batch vs Real-time
Batch Difference Real-time
Arbitrarily, or Periodically Continuity Constant
Store → Process
(Hadoop MapReduce, Hive, Pig, Spark)
Method of analysis
Process → Store
(Spark Streaming, Flink, Apache Storm)
Small - Huge (KB~TB) Data size per a unit Small (B~KB)
Low - High (minutes to hours) Query Latency Low (milliseconds to minutes)
Low - High (hourly/daily/monthly) Request Rate Very High - High (in seconds, minutes)
High - Very high Durability Low - High
¢~$ (Amazon S3, Glacier) Cost/GB $$~$ (Redis, Memcached)
From Batch to Real-time:
Lambda Architecture
Data
Source
Stream
Storage
Speed Layer
Batch Layer
Batch
Process
Batch
View
Real-
time
View
Consumer
Query & Merge
Results
Service Layer
Stream
Ingestion
Raw Data
Storage
Streaming Data
Stream
Delivery
Stream
Process
Lambda Architecture
Streaming
Data
Batch View
Stream Process
Real-time
View
Query
Query
Batch View
Real-time
View
Raw Data
Batch Process
Batch Layer Serving Layer
Speed Layer
Key Components of Real-time Analytics
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Devices and/or
applications that
produce real-time
data at high
velocity
Data from tens of
thousands of data
sources can be
written to a single
stream
Data are stored in the
order they were
received for a set
duration of time and
can be replayed
indefinitely during
that time
Records are read in
the order they are
produced, enabling
real-time analytics
or streaming ETL
Data lake
(most common)
Database
(least common)
Where to Store Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Stream Storage
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
Hash
Function
Consumer
Consumer
Consumer
Consumer Group
PK
PK
PK
PK
= next consumer offset oldest datanewest data
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
Producers
shard/partition-1
shard/partition-2
5 4 3 2 1 0
3 2 1 0
4 3 2 1 0
4
2
0
shard/partition-3
Why is Stream Storage?
• Decouple producers &
consumers
• Persistent buffer
• Collect multiple streams
• Preserve client ordering
• Parallel consumption
• Streaming MapReduce
• Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• No client ordering (standard)
• FIFO queue preserves client
ordering
• No streaming MapReduce
• No parallel consumption
• Amazon SNS can publish to
multiple SNS subscribers
(queues or Lambda functions)
Consumers
4 3 2 1
12344 3 2 1
1234
2134
13342
Standard
FIFO
Producers
Amazon SQS Queue
What about SQS?
Publisher
Amazon SNS
Topic
AWS Lambda
function
Amazon SQS
queue
Queue
Subscriber
Topic
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
• Operational Considerations
• Number of clusters?
• Number of brokers per cluster?
• Number of topics per broker?
• Number of partitions per topic?
• Only increase number of
partitions; can’t decrease
• Integration with a few of AWS
Services such as Kinesis Data
Analytics for Java
• Operational Considerations
• Number of Kinesis Data Streams?
• Number of shards per stream?
• Increase/Decrease number of
shards
• Fully Integration with AWS
Services such as Lambda
function, Kinesis Data Analytics,
etc
RequestQueue
- Length
- WaitTime
ResponseQueue
- Length
- WaitTime
Network
- Packet Drop?
Produce/Consume Rate Unbalance
Who is Leader? Disk Full?
Too many topics?
Metrics to Monitor: MSK (Kafka)
Metrics to Monitor: MSK (Kafka)
Metric Level Description
ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time.
OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster.
GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster.
GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster.
KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs.
KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs.
RootDiskUsed DEFAULT The percentage of the root disk used by the broker.
PartitionCount PER_BROKER The number of partitions for the broker.
LeaderCount PER_BROKER The number of leader replicas.
UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker.
UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker.
FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on
fetching data from the broker.
ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.
How about monitoring Kinesis Data Streams?
Consumer
Application
GetRecords()
Data
How long time does a record stay in a shard?
Metrics to Monitor: Kinesis Data Streams
Metric Description
GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords
ReadProvisionedThroughputExceeded Number of GetRecords calls throttled
WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled
PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations
GetRecords.Success Number of successful GetRecords operations
Choosing Good Metrics
Too much information can be just as useless as too little
How to Ingest Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Stream Ingestion
• AWS SDKs
• Publish directly from application code via APIs
• AWS Mobile SDK
• Kinesis Agent
• Monitors log files and forwards lines as messages to
Kinesis Data Streams
• Kinesis Producer Library (KPL)
• Background process aggregates and batches messages
• 3rd party and open source
• Kafka Connect (kinesis-kafka-connector)
• fluentd (aws-fluent-plugin-kinesis)
• Log4J Appender (kinesis-log4j-appender)
• and more …
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Amazon Kinesis
Data Streams
How to Process Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Elasticsearch
Redshift
Stream Delivery
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Stream
Delivery
Kinesis
Data Firehose
• Kinesis Agent
• CloudWatch Logs
• CloudWatch Events
• AWS IoT
• Direct PUT using APIs
• Kinesis Data Streams
• MSK(Kafka) using
Kafka Connect
Kinesis
Data Analytics
S3
Kinesis Firehose: Filter, Enrich, Convert
Data
Source
apache log
apache log
json Data
Sink
[Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178]
[Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1]
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
}
json
Lambda function
Kinesis
Data Firehose
Pre-built Data Transformation Blueprints
Blueprint Description
General Processing For custom transformation logic
Apache Log to JSON
Parses and converts Apache log lines to JSON objects using predefined
JSON field names
Apache Log to CSV Parses and converts Apache log lines to CSV format
Syslog to JSON
Parses and converts Syslog lines to JSON objects using predefined JSON
field names
Syslog to CSV Parses and converts Syslog lines to CSV format
Pre-built Data Conversion
Data
Source
Kinesis
Data Firehose
JSON Data
schema
AWS Glue Data
Catalog
Amazon S3
• Convert the format of your input data from JSON to columnar data
format Apache Parquet or Apache ORC before storing the data in
Amazon S3
• Works in conjunction to the transform features to convert other format
to JSON before the data conversion
convert to
columnar format
/failed
Failure and Error Handling
• S3 Destination
• Pause and retry for up to 24 hours (maximum data retention period)
• If data delivery fails for more than 24 hours, your data is lost.
• Redshift Destination
• Configurable retry duration (0-2 hours)
• After retry, skip and load error manifest files to S3’s errors/ folder
• Elasticsearch Destination
• Configurable retry duration (0-2 hours)
• After retry, skip and load failed records to S3’s elasticsearch_failed/
folder
Stream Process
• Transform
• Filter, Enrich, Convert
• Aggregation
• Windows Queries
• Top-K Contributor
• Join
• Stream-Stream Join
• Stream-(External) Table Join
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
AWS Lambda
Amazon Kinesis
Data Analytics
AWS Glue Amazon EMR
Dive into Stream Process
Services
AWS Lambda
• Serverless functions
• Event-based, stateless processing
• Continuous and simple scaling mechanism
event
(3)
event
(2)
event
(1)
Lambda
(1)
Lambda
(2)
Lambda
(3)
Amazon Kinesis
Data Analytics
AWS Glue Amazon EMR
Serverless ServerlessFully Managed
Architecture: Master-Worker
Master
Worker
(1)
Worker
(2)
Worker
(3)
part-01
part-02
part-03
part-01
part-02
part-03
Master
Workers
Architecture
Architecture
Workers
Master
Streaming Programming
Guide
Treat Streams as Unbounded Tables
“It's raining cats and dogs!”
["It's", "raining", "cats", "and", "dogs!"]
[("It's", 1), ("raining", 1), ("cats", 1),
("and", 1), ("dogs!", 1)]
It’s 1
raining 1
cats 1
and 1
dogs! 1
“It's raining cats and dogs!”
["It's", "raining", "cats", "and", "dogs!"]
[("It's", 1), ("raining", 1), ("cats", 1),
("and", 1), ("dogs!", 1)]
It’s 1
raining 1
cats 1
and 1
dogs! 1
Setup session
Read stream
Start
running
Apply
Streaming
ETL
What about (Stream) SQL?
Data
Source
Stream
Storage
Stream
SQL
Process
Stream
Ingestion
Data
Sink
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
“It's raining cats and dogs!” It’s 1
raining 1
cats 1
and 1
dogs! 1
Kinesis Data Analytics (SQL)
• STREAM (in-application): a continuously
updated entity that you can SELECT from and
INSERT into like a TABLE
• PUMP: an entity used to continuously
'SELECT ... FROM' a source STREAM, and
INSERT SQL results into an output STREAM
• Create output stream, which can be used to
send to a destination
SOURCE
STREAM
INSERT
& SELECT
(PUMP)
DESTIN.
STREAM
Destination
Source
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
Kinesis Data
Analytics
SQL vs Java
DEMO
https://aws.amazon.com/ko/blogs/aws/new-amazon-kinesis-data-analytics-for-java/
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon S3Amazon Kinesis
Data Analytics
(Java)
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
(SQL)
DEMO: Word Count
“It's raining cats and dogs!”
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
It’s 1
raining 1
cats 1
and 1
dogs! 1
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
1
2
Filter, Enrich, Convert
Streaming Data
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Revisit Example: Filter, Enrich, Convert
Data
Source
Kinesis
Data Firehose
apache log
apache log
json Data
Sink
[Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178]
[Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1]
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
}
json
Lambda function
Stream Process: Filter, Enrich, Convert
Data
Source
apache log
apache log
json Data
Sink
[Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178]
[Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1]
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
}
Amazon Kinesis
Data Streams
Lambda function
Amazon EMR AWS GlueAmazon Kinesis
Data Analytics
Stream Process: Filter, Enrich, Convert
Data
Source
apache log
apache log
json Data
Sink
[Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178]
[Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1]
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
}
geo-ip
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "192.34.86.178",
"city": "Boston",
"state": "MA"
},
},
{
"recordId": "2",
"result": "Dropped"
} Amazon EMR AWS Glue
Amazon MSK
Amazon Kinesis
Data Analytics
(Java)
Kinesis Data Analytics (SQL):
Preprocessing Data
https://aws.amazon.com/ko/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/
Integration of Stream Process and
Stream Storage
Amazon
Lambda
Kinesis Data
Analytics (SQL)
Kinesis Data
Analytics (Java)
Glue EMR
Kinesis Data
Firehose O O X X X
Kinesis Data
Streams O O O O O
Managed
Streaming for
Kafka (MSK)
X X O O O
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Aggregate Streaming Data
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Stream Process: Aggregation
• Aggregations (count, sum, min,...) take granular real time data
and turn it into insights
• Data is continuously processed so you need to tell the
application when you want results
• Windowed Queries
a. Sliding Windows (with Overlap)
b. Tumbling Windows (No Overlap)
c. Custom Windows
Join Streaming Data
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Imagine It! How to build?
Stream Process: Join
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
Data
Source
Stream
Storage
Data
Source
Stream
Process
(a) Stream-Stream Join
(b) Stream-Join by Partition Key
(c) Stream-Join by Hash Table
Data
Source
Stream
Storage
Stream
Process
Key-Value
Storage
Why Stream-Stream Join is so difficult?
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
Data
Sink
t0t1t2tN
. . . . . . .
• Timing
• Skewed data
∆𝑡
∆𝑡
∆𝑡
How about Stream-Join by Partition Key?
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
Data
Source
Stream
Storage
Data
Source
Stream
Process
t1t2t3t5
t1t2t3t5
t1t1t2t3
Each shard will be filled with records
coming from fast data producers
shard-1
shard-2
shard-3
Lastly, how about Stream-Join by Hash
Table?
Data
Source
Stream
Storage
Stream
Process
Key-Value
Storage
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
DEMO
{"TICKER_SYMBOL": "CVB",
"SECTOR": "TECHNOLOGY",
"CHANGE": 0.81,
"PRICE": 53.63}
{"TICKER_SYMBOL": "ABC",
"SECTOR": "RETAIL",
"CHANGE": -1.14,
"PRICE": 23.64}
{"TICKER_SYMBOL": "JKL",
"SECTOR": "TECHNOLOGY",
"CHANGE": 0.22,
"PRICE": 15.32}
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
(SQL)
DEMO: Filter, Aggregate, Join
Continuous filter
Aggregate function
Data enrichment (join)
Bucket
with objects
Ticker,Company
AMZN,Amazon
ASD,SomeCompanyA
BAC,SomeCompanyB
CRM,SomeCompanyC
https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-add-reference-data.html
Comparing Stream Process
Services
DevOps! Master-Worker Framework
Master
Worker
(1)
Worker
(2)
Worker
(3)
part-01
part-02
part-03
part-01
part-02
part-03
Master is alive?
Worker has enough resources such as CPU, Memory, Disk?
Checkpoint?
Right Instance Type?
C-Family, or R-Family?
Learning curve?
- SQL
- Python
- Scala
- Java
EMR vs Glue vs Kinesis Data Analytics
Operational
Excellence
Kinesis Data
Analytics (SQL)
EMR
Glue
Kinesis Data
Analytics (Java)
Degree of Freedom
≈ Complexity
AWS Glue
Comparing stream processing services
AWS Lambda Amazon Kinesis
Data Analytics
Amazon EMR
Simple programming
interface and scaling
• Serverless functions
• Six languages (Java,
Python, Golang,
Node.js, Ruby, C#)
• Event-based, stateless
processing
• Continuous and simple
scaling mechanism
Easy and powerful
stream processing
Simple, flexible, and
cost-effective ETL & Data
Catalog
Flexibility and choice for
your needs
• Serverless applications
• Supports SQL and Java
(Apache Flink)
• Stateful processing
with automatic backups
• Stream operators make
building app easy
• Serverless applications
• Can use the transforms
native to Apache Spark
Structured Streaming
• Automatically discover
new data, extracts
schema definitions
• Automatically
generates the ETL code
• Choose your instances
• Use your favorite
open-source
framework
• Fine-grained control
over cluster,
debugging tools, and
more
• Deep open-source tool
integrations with AWS
Case Studies
Example Usage Pattern 1: Web Analytics
and Leaderboards
Amazon
DynamoDB
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Streams
Amazon
Cognito
Lightweight JS
client code
Web server on
Amazon EC2
OR
Compute top 10 usersIngest web app data Persist to feed live apps
Lambda
function
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/
Example Usage Pattern 2: Monitoring
IoT Devices
Ingest sensor data
Convert json
to parquet
Store all data points
in an S3 data lake
https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/
Example Usage Pattern 3: Analyzing
AWS CloudTrail Event Logs
AWS
CloudTrail
CloudWatch
Events trigger
Kinesis
Data Analytics
Lambda
function
S3 bucket
for raw data
DynamoDB
table
Chart.JS
dashboard
Compute operational
metrics
Ingest raw log data Deliver to real time
dashboards and archival
Kinesis Data
Firehose
https://aws.amazon.com/solutions/implementations/real-time-insights-account-activity/
Takeaways
From Batch to Real-time:
Lambda Architecture
Data
Source
Stream
Storage
Speed Layer
Batch Layer
Batch
Process
Batch
View
Real-
time
View
Consumer
Query & Merge
Results
Service Layer
Stream
Ingestion
Raw Data
Storage
Streaming Data
Stream
Delivery
Stream
Process
Key Components of Real-time Analytics
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
AWS Lambda
Kinesis Data
Analytics
Glue EMR
Kinesis Data
Firehose
Kinesis Data
Streams
Managed
Streaming for
Kafka
Real-Time Applications
- Aggregation
- Top-K Contributor
- Anomaly Detection
Streaming ETL
- Filter, Enrich, Convert
- Join
Kafka Connect
KPL
Kinesis Agent
AWS SDKs
Key Takeaways
• Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
• Data Source → Stream Ingestion → Stream Storage → Stream
Process → Data Sink
• Follow the principle of "extract data once and reuse multiple
times” to power new customer experiences
• Use the right tool for the job
• Know the AWS services soft and hard limits
• Leverage managed and serverless services (DevOps!)
• Scalable/elastic, available, reliable, secure, no/low admin
Where To Go Next?
• AWS Analytics Immersion Day - Build BI System from Scratch
• Workshop - https://tinyurl.com/yapgwv77
• Slides - https://tinyurl.com/ybxkb74b
• Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1, 2
• Part1 - https://tinyurl.com/y8vo8q7o
• Part2 - https://tinyurl.com/ycbv7wel
• Streaming Analytics Workshop – Kinesis Data Analytics for Java (Flink)
https://streaming-analytics.labgui.de/
• Amazon MSK Labs
https://amazonmsk-labs.workshop.aws/
• Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming
https://tinyurl.com/y7hklyff
• AWS Glue Streaming ETL - Scala Script Example
https://tinyurl.com/y79x6jda
Appendix
• Amazon Managed Streaming for Apache Kafka: Best Practices
https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html
• Optimizing Your Apache Kafka® Deployment
https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
• Monitoring Kafka performance metrics
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/

Realtime Analytics on AWS

  • 1.
    Real-time Analytics on Sungmin,Kim Solutions Architect, AWS
  • 2.
    Agenda • Why Real-timeData streaming and Analytics? • How to Build? • Where to Store streaming data? • How to Ingest streaming data? • How to Process streaming data? • Delivery Streaming Data • Dive into Stream Process Framework • Transform, Aggregate, Join Streaming Data • Case Studies • Key Takeaways
  • 3.
    Why Real-time Datastreaming and Analytics?
  • 4.
    Data The world’s most valuableresource is no longer oil, but data.* *Copyright: David Parkins , The Economist, 2017 “ ”
  • 5.
    Data Loses ValueOver Time * Source: Mike Gualtieri, Forrester, Perishable insights Real time Seconds Minutes Hours Days Months Valueofdatatodecision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence
  • 6.
    To create Value,derive insights in Real-time
  • 7.
    Batch vs Real-time BatchDifference Real-time Arbitrarily, or Periodically Continuity Constant Store → Process (Hadoop MapReduce, Hive, Pig, Spark) Method of analysis Process → Store (Spark Streaming, Flink, Apache Storm) Small - Huge (KB~TB) Data size per a unit Small (B~KB) Low - High (minutes to hours) Query Latency Low (milliseconds to minutes) Low - High (hourly/daily/monthly) Request Rate Very High - High (in seconds, minutes) High - Very high Durability Low - High ¢~$ (Amazon S3, Glacier) Cost/GB $$~$ (Redis, Memcached)
  • 8.
    From Batch toReal-time: Lambda Architecture Data Source Stream Storage Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process
  • 9.
    Lambda Architecture Streaming Data Batch View StreamProcess Real-time View Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer
  • 10.
    Key Components ofReal-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)
  • 11.
    Where to StoreStreaming Data? Data Source Stream Storage Stream Process Stream Ingestion Data Sink
  • 12.
  • 13.
    Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK = nextconsumer offset oldest datanewest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3
  • 14.
    Why is StreamStorage? • Decouple producers & consumers • Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce
  • 15.
    • Decouple producers& consumers • Persistent buffer • Collect multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 12344 3 2 1 1234 2134 13342 Standard FIFO Producers Amazon SQS Queue What about SQS? Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber
  • 16.
    Topic Amazon Kinesis Data Streams AmazonManaged Streaming for Kafka
  • 17.
    Amazon Kinesis Data Streams AmazonManaged Streaming for Kafka • Operational Considerations • Number of clusters? • Number of brokers per cluster? • Number of topics per broker? • Number of partitions per topic? • Only increase number of partitions; can’t decrease • Integration with a few of AWS Services such as Kinesis Data Analytics for Java • Operational Considerations • Number of Kinesis Data Streams? • Number of shards per stream? • Increase/Decrease number of shards • Fully Integration with AWS Services such as Lambda function, Kinesis Data Analytics, etc
  • 18.
    RequestQueue - Length - WaitTime ResponseQueue -Length - WaitTime Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? Metrics to Monitor: MSK (Kafka)
  • 19.
    Metrics to Monitor:MSK (Kafka) Metric Level Description ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.
  • 20.
    How about monitoringKinesis Data Streams? Consumer Application GetRecords() Data How long time does a record stay in a shard?
  • 21.
    Metrics to Monitor:Kinesis Data Streams Metric Description GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords ReadProvisionedThroughputExceeded Number of GetRecords calls throttled WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations GetRecords.Success Number of successful GetRecords operations
  • 22.
    Choosing Good Metrics Toomuch information can be just as useless as too little
  • 23.
    How to IngestStreaming Data? Data Source Stream Storage Stream Process Stream Ingestion Data Sink
  • 24.
    Stream Ingestion • AWSSDKs • Publish directly from application code via APIs • AWS Mobile SDK • Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams • Kinesis Producer Library (KPL) • Background process aggregates and batches messages • 3rd party and open source • Kafka Connect (kinesis-kafka-connector) • fluentd (aws-fluent-plugin-kinesis) • Log4J Appender (kinesis-log4j-appender) • and more … Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams
  • 25.
    How to ProcessStreaming Data? Data Source Stream Storage Stream Process Stream Ingestion Data Sink
  • 26.
    Elasticsearch Redshift Stream Delivery Data Source Stream Storage Stream Process Stream Ingestion Data Sink Stream Delivery Kinesis Data Firehose •Kinesis Agent • CloudWatch Logs • CloudWatch Events • AWS IoT • Direct PUT using APIs • Kinesis Data Streams • MSK(Kafka) using Kafka Connect Kinesis Data Analytics S3
  • 27.
    Kinesis Firehose: Filter,Enrich, Convert Data Source apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function Kinesis Data Firehose
  • 28.
    Pre-built Data TransformationBlueprints Blueprint Description General Processing For custom transformation logic Apache Log to JSON Parses and converts Apache log lines to JSON objects using predefined JSON field names Apache Log to CSV Parses and converts Apache log lines to CSV format Syslog to JSON Parses and converts Syslog lines to JSON objects using predefined JSON field names Syslog to CSV Parses and converts Syslog lines to CSV format
  • 29.
    Pre-built Data Conversion Data Source Kinesis DataFirehose JSON Data schema AWS Glue Data Catalog Amazon S3 • Convert the format of your input data from JSON to columnar data format Apache Parquet or Apache ORC before storing the data in Amazon S3 • Works in conjunction to the transform features to convert other format to JSON before the data conversion convert to columnar format /failed
  • 30.
    Failure and ErrorHandling • S3 Destination • Pause and retry for up to 24 hours (maximum data retention period) • If data delivery fails for more than 24 hours, your data is lost. • Redshift Destination • Configurable retry duration (0-2 hours) • After retry, skip and load error manifest files to S3’s errors/ folder • Elasticsearch Destination • Configurable retry duration (0-2 hours) • After retry, skip and load failed records to S3’s elasticsearch_failed/ folder
  • 31.
    Stream Process • Transform •Filter, Enrich, Convert • Aggregation • Windows Queries • Top-K Contributor • Join • Stream-Stream Join • Stream-(External) Table Join Data Source Stream Storage Stream Process Stream Ingestion Data Sink AWS Lambda Amazon Kinesis Data Analytics AWS Glue Amazon EMR
  • 32.
    Dive into StreamProcess Services
  • 33.
    AWS Lambda • Serverlessfunctions • Event-based, stateless processing • Continuous and simple scaling mechanism event (3) event (2) event (1) Lambda (1) Lambda (2) Lambda (3)
  • 34.
    Amazon Kinesis Data Analytics AWSGlue Amazon EMR Serverless ServerlessFully Managed
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
    Treat Streams asUnbounded Tables
  • 40.
    “It's raining catsand dogs!” ["It's", "raining", "cats", "and", "dogs!"] [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1
  • 42.
    “It's raining catsand dogs!” ["It's", "raining", "cats", "and", "dogs!"] [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1
  • 44.
  • 45.
    What about (Stream)SQL? Data Source Stream Storage Stream SQL Process Stream Ingestion Data Sink [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] “It's raining cats and dogs!” It’s 1 raining 1 cats 1 and 1 dogs! 1
  • 46.
    Kinesis Data Analytics(SQL) • STREAM (in-application): a continuously updated entity that you can SELECT from and INSERT into like a TABLE • PUMP: an entity used to continuously 'SELECT ... FROM' a source STREAM, and INSERT SQL results into an output STREAM • Create output stream, which can be used to send to a destination SOURCE STREAM INSERT & SELECT (PUMP) DESTIN. STREAM Destination Source [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)]
  • 47.
  • 48.
  • 49.
    https://aws.amazon.com/ko/blogs/aws/new-amazon-kinesis-data-analytics-for-java/ Amazon Kinesis Data Streams AmazonKinesis Data Firehose Amazon S3Amazon Kinesis Data Analytics (Java) Amazon Kinesis Data Streams Amazon Kinesis Data Streams Amazon Kinesis Data Analytics (SQL) DEMO: Word Count “It's raining cats and dogs!” [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1 [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] 1 2
  • 50.
    Filter, Enrich, Convert StreamingData Data Source Stream Storage Stream Process Stream Ingestion Data Sink
  • 51.
    Revisit Example: Filter,Enrich, Convert Data Source Kinesis Data Firehose apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function
  • 52.
    Stream Process: Filter,Enrich, Convert Data Source apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } Amazon Kinesis Data Streams Lambda function Amazon EMR AWS GlueAmazon Kinesis Data Analytics
  • 53.
    Stream Process: Filter,Enrich, Convert Data Source apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } Amazon EMR AWS Glue Amazon MSK Amazon Kinesis Data Analytics (Java)
  • 54.
    Kinesis Data Analytics(SQL): Preprocessing Data https://aws.amazon.com/ko/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/
  • 55.
    Integration of StreamProcess and Stream Storage Amazon Lambda Kinesis Data Analytics (SQL) Kinesis Data Analytics (Java) Glue EMR Kinesis Data Firehose O O X X X Kinesis Data Streams O O O O O Managed Streaming for Kafka (MSK) X X O O O Data Source Stream Storage Stream Process Stream Ingestion Data Sink
  • 56.
  • 57.
    Stream Process: Aggregation •Aggregations (count, sum, min,...) take granular real time data and turn it into insights • Data is continuously processed so you need to tell the application when you want results • Windowed Queries a. Sliding Windows (with Overlap) b. Tumbling Windows (No Overlap) c. Custom Windows
  • 58.
  • 59.
    Imagine It! Howto build?
  • 60.
    Stream Process: Join Data Source Stream Storage Data Source Stream Storage Stream Process Data Source Stream Storage Data Source Stream Process (a)Stream-Stream Join (b) Stream-Join by Partition Key (c) Stream-Join by Hash Table Data Source Stream Storage Stream Process Key-Value Storage
  • 61.
    Why Stream-Stream Joinis so difficult? Data Source Stream Storage Data Source Stream Storage Stream Process Data Sink t0t1t2tN . . . . . . . • Timing • Skewed data ∆𝑡 ∆𝑡 ∆𝑡
  • 62.
    How about Stream-Joinby Partition Key? Data Source Stream Storage Data Source Stream Storage Stream Process Data Source Stream Storage Data Source Stream Process t1t2t3t5 t1t2t3t5 t1t1t2t3 Each shard will be filled with records coming from fast data producers shard-1 shard-2 shard-3
  • 63.
    Lastly, how aboutStream-Join by Hash Table? Data Source Stream Storage Stream Process Key-Value Storage Data Source Stream Storage Data Source Stream Storage Stream Process
  • 64.
  • 65.
    {"TICKER_SYMBOL": "CVB", "SECTOR": "TECHNOLOGY", "CHANGE":0.81, "PRICE": 53.63} {"TICKER_SYMBOL": "ABC", "SECTOR": "RETAIL", "CHANGE": -1.14, "PRICE": 23.64} {"TICKER_SYMBOL": "JKL", "SECTOR": "TECHNOLOGY", "CHANGE": 0.22, "PRICE": 15.32} Amazon Kinesis Data Streams Amazon Kinesis Data Analytics (SQL) DEMO: Filter, Aggregate, Join Continuous filter Aggregate function Data enrichment (join) Bucket with objects Ticker,Company AMZN,Amazon ASD,SomeCompanyA BAC,SomeCompanyB CRM,SomeCompanyC https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-add-reference-data.html
  • 66.
  • 67.
    DevOps! Master-Worker Framework Master Worker (1) Worker (2) Worker (3) part-01 part-02 part-03 part-01 part-02 part-03 Masteris alive? Worker has enough resources such as CPU, Memory, Disk? Checkpoint? Right Instance Type? C-Family, or R-Family? Learning curve? - SQL - Python - Scala - Java
  • 68.
    EMR vs Gluevs Kinesis Data Analytics Operational Excellence Kinesis Data Analytics (SQL) EMR Glue Kinesis Data Analytics (Java) Degree of Freedom ≈ Complexity
  • 69.
    AWS Glue Comparing streamprocessing services AWS Lambda Amazon Kinesis Data Analytics Amazon EMR Simple programming interface and scaling • Serverless functions • Six languages (Java, Python, Golang, Node.js, Ruby, C#) • Event-based, stateless processing • Continuous and simple scaling mechanism Easy and powerful stream processing Simple, flexible, and cost-effective ETL & Data Catalog Flexibility and choice for your needs • Serverless applications • Supports SQL and Java (Apache Flink) • Stateful processing with automatic backups • Stream operators make building app easy • Serverless applications • Can use the transforms native to Apache Spark Structured Streaming • Automatically discover new data, extracts schema definitions • Automatically generates the ETL code • Choose your instances • Use your favorite open-source framework • Fine-grained control over cluster, debugging tools, and more • Deep open-source tool integrations with AWS
  • 70.
  • 71.
    Example Usage Pattern1: Web Analytics and Leaderboards Amazon DynamoDB Amazon Kinesis Data Analytics Amazon Kinesis Data Streams Amazon Cognito Lightweight JS client code Web server on Amazon EC2 OR Compute top 10 usersIngest web app data Persist to feed live apps Lambda function https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
  • 72.
    https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/ Example Usage Pattern2: Monitoring IoT Devices Ingest sensor data Convert json to parquet Store all data points in an S3 data lake https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/
  • 73.
    Example Usage Pattern3: Analyzing AWS CloudTrail Event Logs AWS CloudTrail CloudWatch Events trigger Kinesis Data Analytics Lambda function S3 bucket for raw data DynamoDB table Chart.JS dashboard Compute operational metrics Ingest raw log data Deliver to real time dashboards and archival Kinesis Data Firehose https://aws.amazon.com/solutions/implementations/real-time-insights-account-activity/
  • 74.
  • 75.
    From Batch toReal-time: Lambda Architecture Data Source Stream Storage Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process
  • 76.
    Key Components ofReal-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink AWS Lambda Kinesis Data Analytics Glue EMR Kinesis Data Firehose Kinesis Data Streams Managed Streaming for Kafka Real-Time Applications - Aggregation - Top-K Contributor - Anomaly Detection Streaming ETL - Filter, Enrich, Convert - Join Kafka Connect KPL Kinesis Agent AWS SDKs
  • 77.
    Key Takeaways • Builddecoupled systems • Data → Store → Process → Store → Analyze → Answers • Data Source → Stream Ingestion → Stream Storage → Stream Process → Data Sink • Follow the principle of "extract data once and reuse multiple times” to power new customer experiences • Use the right tool for the job • Know the AWS services soft and hard limits • Leverage managed and serverless services (DevOps!) • Scalable/elastic, available, reliable, secure, no/low admin
  • 78.
    Where To GoNext? • AWS Analytics Immersion Day - Build BI System from Scratch • Workshop - https://tinyurl.com/yapgwv77 • Slides - https://tinyurl.com/ybxkb74b • Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1, 2 • Part1 - https://tinyurl.com/y8vo8q7o • Part2 - https://tinyurl.com/ycbv7wel • Streaming Analytics Workshop – Kinesis Data Analytics for Java (Flink) https://streaming-analytics.labgui.de/ • Amazon MSK Labs https://amazonmsk-labs.workshop.aws/ • Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming https://tinyurl.com/y7hklyff • AWS Glue Streaming ETL - Scala Script Example https://tinyurl.com/y79x6jda
  • 79.
    Appendix • Amazon ManagedStreaming for Apache Kafka: Best Practices https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html • Optimizing Your Apache Kafka® Deployment https://www.confluent.io/blog/optimizing-apache-kafka-deployment/ • Monitoring Kafka performance metrics https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/