Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Real-time Stream Processing on EMR:
Apache Flink vs Apache Spark Streaming
Keith Steward, Ph.D.
Specialist (EMR) Solution Architect
AWS

What we’ll cover:
1. The need for real-time stream processing, and challenges in
accomplishing it
2. Flink stream processor (versus Spark Streaming):
• What are its aims?
• How does it address real-time stream processing challenges?
• How does it differ from Spark Streaming?
• Real-world Flink examples
• When to use Flink vs Spark Streaming?
3. Flink Demo: How to deploy & run a Flink stream processing
architecture in AWS?

The Need for Real-Time Stream Processing
Increasingly, data is arriving as continuous flows
of events:
• cars in motion emitting GPS signals
• financial transactions
• interchange of signals between cell phone towers
and people busy with their smartphones
• web traffic
• machine logs
• measurements from industrial sensors and wearable
devices
Streaming data is a better fit for the way we
live.

Challenges in Processing Streams:
• Event time (rather than data processing
time); out of order events
• Consistency, fault tolerance, and high
availability
• Rich forms of window queries; real-time
alerts
• Low latency and high throughput

Hot
Reference architecture
COLLECT STORE CONSUMEPROCESS / ANALYZEETL
Traditional Big Data Pipeline (without Stream Processing)

COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FileMessageStreamSearchSQLNoSQLCache
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
ETL
Amazon SQS apps
Streaming
Amazon Kinesis Analytics
KCL
apps
AWS Lambda
Amazon Redshift
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
BatchMessageInteractiveStreamML
Amazon EC2
Amazon EC2
Amazon EMR
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
Data centers
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTTransportMessaging
AWS Direct
Connect
AWS IoT

Service
Apache Kafka
Amazon Kinesis
Streams
Amazon DynamoDB
Amazon ElastiCache
Amazon DynamoDB
Streams
HotHotWarm
StreamSearchNoSQLCache
RECORDS
STREAMS
ETL
Streaming
KCL
apps
AWS Lambda
Fast
BatchInteractiveStreamML
Amazon EC2
Amazon EMR
Apps & Services
Applications
Mobile apps
Web apps
Devices
Message
Sensors &
IoT platforms
Data centers
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTTransport
AWS Direct
Connect
AWS IoT
Stream Processing
Fast

“Apache Flink is an open source platform for
distributed stream and batch data processing.”
Flink is a Stream-First Architecture, that happens
to also do batch processing as special case of
bounded stream processing.

Flink Handles Challenging Scenarios:
1.Application code upgrades
2.Flink version upgrades
3.Maintenance and migration
4.What-if simulations (reinstatements)
5.A/B testing

Flink addresses all Streaming Challenges Simultaneously

• Amazon Kinesis Streams
• Apache Kafka
• Elasticsearch
• Twitter Streaming API
• Cassandra
There are connectors for third-party data sources:

Where is Flink being used in production?

When to use Flink vs Spark Streaming?
Flink might be best when:
• workload demands true real-time
stream processing performance with
low latency, high throughput, and
fault-tolerance.
• not yet heavily invested in Spark
Streaming (existing systems, staff
training/experience)
• want convenience of replaying and
reprocessing streams after
code/system changes
Spark Streaming might be
best when:
• Primarily do batch processing
• Already invested in Spark Streaming
(existing deployments, staff)
• The micro-batching is acceptable for
your workload
• Need to code in Python or R
• Want to “wait and see” how Flink
matures before adopting.

How to deploy & run Flink in AWS ?

Flink Demo:
Analyzing NYC Taxi Rides in Real Time

Demo Event Processing Architecture
“Replayable
Log”
Stream
Processing
Visualization

Demo Event Processing Architecture
Amazon
Kinesis
Amazon
EMR
+
EC2 instance
(bastion host)
Amazon
Elasticsearch
Service

Real-time streaming
High throughput; elastic
Keeps a ‘replayable log’ of your events
Easy to use
S3, Redshift, DynamoDB Integrations
Amazon Kinesis

Dynamically Scalable transient or
persistent Hadoop clusters as a service
Hadoop, Hive, Spark, Presto, Hbase, … (17
applications)
Easy to use; fully managed
On demand, reserved, spot pricing
HDFS, S3, and Amazon EBS filesystems
End to end security: access controls,
firewalls, encryption
Amazon EMR

Provisions & maintains an Elasticsearch
cluster (distributed index)
Complete ELK stack, including Kibana
Fully managed service; zero admin
Highly available & reliable
Scalable

1. For predominantly stream-processing workloads but
also for batch processing, Apache Flink has much to
offer:
• Simultaneously addresses high throughput, low latency,
and fault-tolerance
• Do both stream processing & batch processing with a
single technology
• Powerful windowing functions
• Convenient capabilities to pause, restart, and change
Flink applications without data loss.
2. Flink is still new and adoption is not as far advanced as
Spark Streaming.
3. AWS makes it easy to run streaming workloads with
Amazon Kinesis and either Spark Streaming or Flink
running on EMR clusters.

Questions?
stewardk@amazon.com
Keith Steward, Ph.D.
Specialist (EMR) SA
AWS

Approximate Event Time – Kinesis details
• Each Amazon Kinesis record includes an
ApproximateArrivalTimestamp
• The timestamp is set when an Amazon Kinesis stream
successfully receives and stores a record
• By default the event time of Flink uses this timestamp
when reading from a Kinesis stream
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

Event Time and Watermarks
• With event time the time of an event is determined by the
producer
• Flink measures progress in event time by means of
Watermarks
• Watermarks must be ingested to each individual Kinesis
shard
DataStream<Event> kinesis = env
.addSource(new FlinkKinesisConsumer<>(...))
.assignTimestampsAndWatermarks(new PunctuatedAssigner())

Data Encryption with Amazon EMR and Flink
Security configuration supports encryption
• for data stored within the file system
• Hadoop Distributed File System (HDFS) block-transfer and
RPC
• S3 data (SSE-S3, SSE-KMS, CSE-KMS, CSE-Custom)
• Local disk (except boot volumes)
• In-transit data (no Flink support yet)
env.readTextFile("s3://...")
env.setStateBackend(new FsStateBackend("hdfs://..."))

Connecting to the Flink Dashboard
• Use dynamic port forwarding to the Master node
ssh -D 8157 hadoop@...
• Use FoxyProxy to redirect URLs to localhost
*ec2*.amazonaws.com*
*.compute.internal*
• Navigate to the YARN Resource Manager and select the
Tracking UI

Starting Flink and Submitting Jobs
Use steps to interact with Flink through the AWS API

Extending Flink Functionality
• Flink Elasticsearch sink merely supports TCP transport
• A custom Elasticsearch sink with HTTP support requires
only a few dozens lines of code using
• Jest (io.searchbox)
• aws-signing-request-interceptor (vc.inreach.aws)

Amazon SQS apps
Streaming
KCL
apps
AWS Lambda
Amazon Redshift
Amazon Machine
Learning
Presto
Amazon
EMR
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
BatchMessageInteractiveStreamML
SearchSQLNoSQLCacheFileMessageStream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
LoggingIoTApplicationsTransportMessaging
ETL
Amazon EMR

Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

Editor's Notes