Organizations are demanding increasingly faster tools to process and analyze data in real time. Apache Spark and Apache Flink have emerged as popular, open source frameworks to address these requirements. In this tech talk, we provide an overview of these technologies and the differences between them. We show how you can deploy Apache Spark and Flink on AWS to address common big data use cases such as batch and real-time data processing, interactive data science, predictive analytics, and more. We talk about common architectures for running these frameworks on Amazon EMR, including tips to connect to Amazon Kinesis – a platform of managed services that makes it easy to work with real-time streaming data in the AWS Cloud – and Apache Kafka, a popular open source platform for streaming data.
Learning Objectives:
• Understand common use cases and differences between Apache Spark and Apache Flink
• Explain deployment modes and best practices for running Spark and Flink on Amazon EMR
• Identify ways to connect to Kinesis and Kafka for streaming ingest
• Describe how to architect streaming jobs for durability and availability
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks
1. Real-time Stream Processing on EMR:
Apache Flink vs Apache Spark Streaming
Keith Steward, Ph.D.
Specialist (EMR) Solution Architect
AWS
2. What we’ll cover:
1. The need for real-time stream processing, and challenges in
accomplishing it
2. Flink stream processor (versus Spark Streaming):
• What are its aims?
• How does it address real-time stream processing challenges?
• How does it differ from Spark Streaming?
• Real-world Flink examples
• When to use Flink vs Spark Streaming?
3. Flink Demo: How to deploy & run a Flink stream processing
architecture in AWS?
3. The Need for Real-Time Stream Processing
Increasingly, data is arriving as continuous flows
of events:
• cars in motion emitting GPS signals
• financial transactions
• interchange of signals between cell phone towers
and people busy with their smartphones
• web traffic
• machine logs
• measurements from industrial sensors and wearable
devices
Streaming data is a better fit for the way we
live.
4. Challenges in Processing Streams:
• Event time (rather than data processing
time); out of order events
• Consistency, fault tolerance, and high
availability
• Rich forms of window queries; real-time
alerts
• Low latency and high throughput
7. COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Elasticsearch
Service
Apache Kafka
Amazon Kinesis
Streams
Amazon DynamoDB
Amazon ElastiCache
Amazon DynamoDB
Streams
HotHotWarm
StreamSearchNoSQLCache
RECORDS
STREAMS
Reference architecture
ETL
Streaming
Amazon Kinesis Analytics
KCL
apps
AWS Lambda
Fast
BatchInteractiveStreamML
Amazon EC2
Amazon EMR
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications
Mobile apps
Web apps
Devices
Message
Sensors &
IoT platforms
Data centers
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTTransport
AWS Direct
Connect
AWS IoT
Stream Processing
Reference architecture
Fast
8.
9. “Apache Flink is an open source platform for
distributed stream and batch data processing.”
Flink is a Stream-First Architecture, that happens
to also do batch processing as special case of
bounded stream processing.
16. When to use Flink vs Spark Streaming?
Flink might be best when:
• workload demands true real-time
stream processing performance with
low latency, high throughput, and
fault-tolerance.
• not yet heavily invested in Spark
Streaming (existing systems, staff
training/experience)
• want convenience of replaying and
reprocessing streams after
code/system changes
Spark Streaming might be
best when:
• Primarily do batch processing
• Already invested in Spark Streaming
(existing deployments, staff)
• The micro-batching is acceptable for
your workload
• Need to code in Python or R
• Want to “wait and see” how Flink
matures before adopting.
21. Real-time streaming
High throughput; elastic
Keeps a ‘replayable log’ of your events
Easy to use
S3, Redshift, DynamoDB Integrations
Amazon Kinesis
22. Dynamically Scalable transient or
persistent Hadoop clusters as a service
Hadoop, Hive, Spark, Presto, Hbase, … (17
applications)
Easy to use; fully managed
On demand, reserved, spot pricing
HDFS, S3, and Amazon EBS filesystems
End to end security: access controls,
firewalls, encryption
Amazon EMR
23. Provisions & maintains an Elasticsearch
cluster (distributed index)
Complete ELK stack, including Kibana
Fully managed service; zero admin
Highly available & reliable
Scalable
25. 1. For predominantly stream-processing workloads but
also for batch processing, Apache Flink has much to
offer:
• Simultaneously addresses high throughput, low latency,
and fault-tolerance
• Do both stream processing & batch processing with a
single technology
• Powerful windowing functions
• Convenient capabilities to pause, restart, and change
Flink applications without data loss.
2. Flink is still new and adoption is not as far advanced as
Spark Streaming.
3. AWS makes it easy to run streaming workloads with
Amazon Kinesis and either Spark Streaming or Flink
running on EMR clusters.
28. Approximate Event Time – Kinesis details
• Each Amazon Kinesis record includes an
ApproximateArrivalTimestamp
• The timestamp is set when an Amazon Kinesis stream
successfully receives and stores a record
• By default the event time of Flink uses this timestamp
when reading from a Kinesis stream
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
29. Event Time and Watermarks
• With event time the time of an event is determined by the
producer
• Flink measures progress in event time by means of
Watermarks
• Watermarks must be ingested to each individual Kinesis
shard
DataStream<Event> kinesis = env
.addSource(new FlinkKinesisConsumer<>(...))
.assignTimestampsAndWatermarks(new PunctuatedAssigner())
30. Data Encryption with Amazon EMR and Flink
Security configuration supports encryption
• for data stored within the file system
• Hadoop Distributed File System (HDFS) block-transfer and
RPC
• S3 data (SSE-S3, SSE-KMS, CSE-KMS, CSE-Custom)
• Local disk (except boot volumes)
• In-transit data (no Flink support yet)
env.readTextFile("s3://...")
env.setStateBackend(new FsStateBackend("hdfs://..."))
31. Connecting to the Flink Dashboard
• Use dynamic port forwarding to the Master node
ssh -D 8157 hadoop@...
• Use FoxyProxy to redirect URLs to localhost
*ec2*.amazonaws.com*
*.compute.internal*
• Navigate to the YARN Resource Manager and select the
Tracking UI
32. Starting Flink and Submitting Jobs
Use steps to interact with Flink through the AWS API
33. Extending Flink Functionality
• Flink Elasticsearch sink merely supports TCP transport
• A custom Elasticsearch sink with HTTP support requires
only a few dozens lines of code using
• Jest (io.searchbox)
• aws-signing-request-interceptor (vc.inreach.aws)
Things Flink does differently:
Throughput, latency, semantics are important
Typically you have to chose 2: Storm: low latency, high throughput but at least once. – Spark Streaming: high streaming, exactly once semantics but micro batching, leading to high latency
With Flink: first open source project that supports all three: high throughput, low latency since each event is processed separately, exactly once
Processing Time vs Event Time: becomes important if you encounter Out of order events:
Network issues, latencies can change ordering of events => makes semantics hard => support for event time is crucial!
Semantics:
You want to have exactly once semantics
Rich forms of window queries
E.g. Spark Streaming supports windows, even when window is larger then batch interval. (typically microbatch is 0.5s, window might be 1h) => will keep enough microbatches around to satisfy window size. But this falls on its face when duration of window is not known in advance e.g. impossible to do session queries (but Flink supports this natively)
For this use case, these session queries would be interesting: how long is a shift of one taxi driver? (more interesting for analytics layer)
So what exactly do we mean by a Big Data Streaming pipeline architecture?
First, what is a traditional Big Data pipeline architecture?
Abstract phases in a big data: collection, storage, iterative processing and/or analysis, then consumption of the stored data.
On AWS we have many concrete options to choose from for implementing each of the big data pipeline phases; we won’t go into each of these; just to illustrate the breadth of options
With higher temperature data, we move into the realm of stream processing pipelines
several options are available for dealing with hot / fast data
Under the Processing / analysis part, our stream-processing options include: Storm,
Main points:
- Streaming first!
– batch as a special case of Streaming
Different way around than Apache Spark, for instance, which focuses on Batch and treats Streams as a continuous series of micro-batches.
Typically you have to chose 2: Storm: low latency, high throughput but at least once. – Spark Streaming: high streaming, exactly once semantics but micro batching, leading to high latency
With Flink: first open source project that supports all three: high throughput, low latency since each event is processed separately, exactly once
Processing Time vs Event Time: becomes important if you encounter Out of order events:
Network issues, latencies can change ordering of events => makes semantics hard => support for event time is crucial!
Semantics:
You want to have exactly once semantics
Rich forms of window queries
E.g. Spark Streaming supports windows, even when window is larger then batch interval. (typically microbatch is 0.5s, window might be 1h) => will keep enough microbatches around to satisfy window size. But this falls on its face when duration of window is not known in advance e.g. impossible to do session queries (but Flink supports this natively)
For this use case, these session queries would be interesting: how long is a shift of one taxi driver? (more interesting for analytics layer)
* DataStream API provides data structures that represent distri ut
YARN as underlying cluster manager => if you want to have a managed YARN cluster, we have you covered!
API’s available in Scala and Java, but some new projects working to provide Python access
Can run in a distributed manner over hundreds or thousands of machines
Flink framework automatically takes care of correctly restoring the computation in the event of machine and other failures, or intentional reprocessing, as in the case of bug fixes or version upgrades.
Exactly Once delivery guarantees
Makes it easy for Flink to perform well in production
Rich windowing functions that work with event time.
Watermarks
SavePoints
Historically stream processing has meant you had to settle for 2 out of 3 concerns: high throughtput, latency, and reliability.
With Flink can get all 3.
Spark Streaming:
* Everything is treated as a batch, including streams.
Stream of data from continuous events is broken into series of small atomic batch jobs (“micro-batches”)
If batches small enough, can approximate true streaming, but latency won’t be true real time.
Can lead to fragile pipelines that mix DevOps with app development concerns.
Flink:
Everything is treated as a batch, including streams.
Stream of data from continuous events is broken into series of small atomic batch jobs (“micro-batches”)
If batches small enough, can approximate true streaming, but latency won’t be true real time.
Can lead to fragile pipelines that mix DevOps with app development concerns.
Real-world data set, no synthetic data
Publicly available
Shows how to approach a real-world problem!
Log of all events that supports ‘replays’, i.e. random access o our event log (for window queries)
Processing: want to have high throughput, exactly once semantics, low latency. Turn data into (perishable) insights
Visualization: to make these perishable insights consumable by human users
http://aws.amazon.com/kinesis
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data.
Dynamic port forwarding turns your SSH client into a SOCKS proxy server
Elasticsearch drops TCP transport in v5 – Flink will have to adapt this anyway.
Elastic is investing in their own HTTP Client; currently using Jest
Spark encrypted shuffles are currently in the works –
But for mapreduce, there is a possibility of enabling end-to-end encrypted data flow/store.
Also there is ability to enable server side encryption on S3 bucket level, which is also supported by EMRFS.