SlideShare a Scribd company logo
1 of 34
Real-time Stream Processing on EMR:
Apache Flink vs Apache Spark Streaming
Keith Steward, Ph.D.
Specialist (EMR) Solution Architect
AWS
What we’ll cover:
1. The need for real-time stream processing, and challenges in
accomplishing it
2. Flink stream processor (versus Spark Streaming):
• What are its aims?
• How does it address real-time stream processing challenges?
• How does it differ from Spark Streaming?
• Real-world Flink examples
• When to use Flink vs Spark Streaming?
3. Flink Demo: How to deploy & run a Flink stream processing
architecture in AWS?
The Need for Real-Time Stream Processing
Increasingly, data is arriving as continuous flows
of events:
• cars in motion emitting GPS signals
• financial transactions
• interchange of signals between cell phone towers
and people busy with their smartphones
• web traffic
• machine logs
• measurements from industrial sensors and wearable
devices
Streaming data is a better fit for the way we
live.
Challenges in Processing Streams:
• Event time (rather than data processing
time); out of order events
• Consistency, fault tolerance, and high
availability
• Rich forms of window queries; real-time
alerts
• Low latency and high throughput
Hot
Reference architecture
COLLECT STORE CONSUMEPROCESS / ANALYZEETL
Traditional Big Data Pipeline (without Stream Processing)
Reference architecture
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FileMessageStreamSearchSQLNoSQLCache
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Reference architecture
ETL
Amazon SQS apps
Streaming
Amazon Kinesis Analytics
KCL
apps
AWS Lambda
Amazon Redshift
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
BatchMessageInteractiveStreamML
Amazon EC2
Amazon EC2
Amazon EMR
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
Data centers
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTTransportMessaging
AWS Direct
Connect
AWS IoT
Reference architecture
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Elasticsearch
Service
Apache Kafka
Amazon Kinesis
Streams
Amazon DynamoDB
Amazon ElastiCache
Amazon DynamoDB
Streams
HotHotWarm
StreamSearchNoSQLCache
RECORDS
STREAMS
Reference architecture
ETL
Streaming
Amazon Kinesis Analytics
KCL
apps
AWS Lambda
Fast
BatchInteractiveStreamML
Amazon EC2
Amazon EMR
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications
Mobile apps
Web apps
Devices
Message
Sensors &
IoT platforms
Data centers
Logging
Amazon
CloudWatch
AWS
CloudTrail
LoggingIoTTransport
AWS Direct
Connect
AWS IoT
Stream Processing
Reference architecture
Fast
“Apache Flink is an open source platform for
distributed stream and batch data processing.”
Flink is a Stream-First Architecture, that happens
to also do batch processing as special case of
bounded stream processing.
Apache Flink
Flink Handles Challenging Scenarios:
1.Application code upgrades
2.Flink version upgrades
3.Maintenance and migration
4.What-if simulations (reinstatements)
5.A/B testing
Flink addresses all Streaming Challenges Simultaneously
Flink Performance Tests
• Amazon Kinesis Streams
• Apache Kafka
• Elasticsearch
• Twitter Streaming API
• Cassandra
There are connectors for third-party data sources:
Where is Flink being used in production?
When to use Flink vs Spark Streaming?
Flink might be best when:
• workload demands true real-time
stream processing performance with
low latency, high throughput, and
fault-tolerance.
• not yet heavily invested in Spark
Streaming (existing systems, staff
training/experience)
• want convenience of replaying and
reprocessing streams after
code/system changes
Spark Streaming might be
best when:
• Primarily do batch processing
• Already invested in Spark Streaming
(existing deployments, staff)
• The micro-batching is acceptable for
your workload
• Need to code in Python or R
• Want to “wait and see” how Flink
matures before adopting.
How to deploy & run Flink in AWS ?
Flink Demo:
Analyzing NYC Taxi Rides in Real Time
Demo Event Processing Architecture
“Replayable
Log”
Stream
Processing
Visualization
Demo Event Processing Architecture
Amazon
Kinesis
Amazon
EMR
+
EC2 instance
(bastion host)
Amazon
Elasticsearch
Service
Real-time streaming
High throughput; elastic
Keeps a ‘replayable log’ of your events
Easy to use
S3, Redshift, DynamoDB Integrations
Amazon Kinesis
Dynamically Scalable transient or
persistent Hadoop clusters as a service
Hadoop, Hive, Spark, Presto, Hbase, … (17
applications)
Easy to use; fully managed
On demand, reserved, spot pricing
HDFS, S3, and Amazon EBS filesystems
End to end security: access controls,
firewalls, encryption
Amazon EMR
Provisions & maintains an Elasticsearch
cluster (distributed index)
Complete ELK stack, including Kibana
Fully managed service; zero admin
Highly available & reliable
Scalable
{show demo!}
1. For predominantly stream-processing workloads but
also for batch processing, Apache Flink has much to
offer:
• Simultaneously addresses high throughput, low latency,
and fault-tolerance
• Do both stream processing & batch processing with a
single technology
• Powerful windowing functions
• Convenient capabilities to pause, restart, and change
Flink applications without data loss.
2. Flink is still new and adoption is not as far advanced as
Spark Streaming.
3. AWS makes it easy to run streaming workloads with
Amazon Kinesis and either Spark Streaming or Flink
running on EMR clusters.
Questions?
stewardk@amazon.com
Keith Steward, Ph.D.
Specialist (EMR) SA
AWS
Additional Details
Approximate Event Time – Kinesis details
• Each Amazon Kinesis record includes an
ApproximateArrivalTimestamp
• The timestamp is set when an Amazon Kinesis stream
successfully receives and stores a record
• By default the event time of Flink uses this timestamp
when reading from a Kinesis stream
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Event Time and Watermarks
• With event time the time of an event is determined by the
producer
• Flink measures progress in event time by means of
Watermarks
• Watermarks must be ingested to each individual Kinesis
shard
DataStream<Event> kinesis = env
.addSource(new FlinkKinesisConsumer<>(...))
.assignTimestampsAndWatermarks(new PunctuatedAssigner())
Data Encryption with Amazon EMR and Flink
Security configuration supports encryption
• for data stored within the file system
• Hadoop Distributed File System (HDFS) block-transfer and
RPC
• S3 data (SSE-S3, SSE-KMS, CSE-KMS, CSE-Custom)
• Local disk (except boot volumes)
• In-transit data (no Flink support yet)
env.readTextFile("s3://...")
env.setStateBackend(new FsStateBackend("hdfs://..."))
Connecting to the Flink Dashboard
• Use dynamic port forwarding to the Master node
ssh -D 8157 hadoop@...
• Use FoxyProxy to redirect URLs to localhost
*ec2*.amazonaws.com*
*.compute.internal*
• Navigate to the YARN Resource Manager and select the
Tracking UI
Starting Flink and Submitting Jobs
Use steps to interact with Flink through the AWS API
Extending Flink Functionality
• Flink Elasticsearch sink merely supports TCP transport
• A custom Elasticsearch sink with HTTP support requires
only a few dozens lines of code using
• Jest (io.searchbox)
• aws-signing-request-interceptor (vc.inreach.aws)
Amazon SQS apps
Streaming
Amazon Kinesis Analytics
KCL
apps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
BatchMessageInteractiveStreamML
SearchSQLNoSQLCacheFileMessageStream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Reference architecture
LoggingIoTApplicationsTransportMessaging
ETL
Amazon EMR

More Related Content

Viewers also liked

Viewers also liked (20)

Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Making (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with CachingMaking (Almost) Any Database Faster and Cheaper with Caching
Making (Almost) Any Database Faster and Cheaper with Caching
 
3P Learning (3PL) - Earning from Learning - equity research initiation report
3P Learning (3PL) - Earning from Learning - equity research initiation report3P Learning (3PL) - Earning from Learning - equity research initiation report
3P Learning (3PL) - Earning from Learning - equity research initiation report
 
Migrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration ServiceMigrating to Amazon RDS with Database Migration Service
Migrating to Amazon RDS with Database Migration Service
 
Architecting on The Cloud
Architecting on The CloudArchitecting on The Cloud
Architecting on The Cloud
 
Introduction to AWS X-Ray
Introduction to AWS X-RayIntroduction to AWS X-Ray
Introduction to AWS X-Ray
 
Streaming Data Analytics with Amazon Kinesis Firehose and Redshift
Streaming Data Analytics with Amazon Kinesis Firehose and RedshiftStreaming Data Analytics with Amazon Kinesis Firehose and Redshift
Streaming Data Analytics with Amazon Kinesis Firehose and Redshift
 
Best Practices with IoT Security - February Online Tech Talks
Best Practices with IoT Security - February Online Tech TalksBest Practices with IoT Security - February Online Tech Talks
Best Practices with IoT Security - February Online Tech Talks
 
Introducing Amazon Lex – A Service for Building Voice or Text Chatbots - Marc...
Introducing Amazon Lex – A Service for Building Voice or Text Chatbots - Marc...Introducing Amazon Lex – A Service for Building Voice or Text Chatbots - Marc...
Introducing Amazon Lex – A Service for Building Voice or Text Chatbots - Marc...
 
Akka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming WorldAkka-chan's Survival Guide for the Streaming World
Akka-chan's Survival Guide for the Streaming World
 
Containerizing Distributed Pipes
Containerizing Distributed PipesContainerizing Distributed Pipes
Containerizing Distributed Pipes
 
Configuration Management with AWS OpsWorks for Chef Automate
Configuration Management with AWS OpsWorks for Chef AutomateConfiguration Management with AWS OpsWorks for Chef Automate
Configuration Management with AWS OpsWorks for Chef Automate
 
Automated Governance of Your AWS Resources
Automated Governance of Your AWS ResourcesAutomated Governance of Your AWS Resources
Automated Governance of Your AWS Resources
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with Esper
 
Getting Started with Docker on AWS
Getting Started with Docker on AWSGetting Started with Docker on AWS
Getting Started with Docker on AWS
 
Application Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless WorldApplication Lifecycle Management in a Serverless World
Application Lifecycle Management in a Serverless World
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
Optimize MySQL Workloads with Amazon Elastic Block Store - February 2017 AWS ...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service
Real-Time Data Exploration and Analytics with Amazon Elasticsearch ServiceReal-Time Data Exploration and Analytics with Amazon Elasticsearch Service
Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
A Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System StrategyA Business-Centric Approach to Design System Strategy
A Business-Centric Approach to Design System Strategy
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
The UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, OcadoThe UX of Automation by AJ King, Senior UX Researcher, Ocado
The UX of Automation by AJ King, Senior UX Researcher, Ocado
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 

Deep Dive of Flink & Spark on Amazon EMR - February Online Tech Talks

  • 1. Real-time Stream Processing on EMR: Apache Flink vs Apache Spark Streaming Keith Steward, Ph.D. Specialist (EMR) Solution Architect AWS
  • 2. What we’ll cover: 1. The need for real-time stream processing, and challenges in accomplishing it 2. Flink stream processor (versus Spark Streaming): • What are its aims? • How does it address real-time stream processing challenges? • How does it differ from Spark Streaming? • Real-world Flink examples • When to use Flink vs Spark Streaming? 3. Flink Demo: How to deploy & run a Flink stream processing architecture in AWS?
  • 3. The Need for Real-Time Stream Processing Increasingly, data is arriving as continuous flows of events: • cars in motion emitting GPS signals • financial transactions • interchange of signals between cell phone towers and people busy with their smartphones • web traffic • machine logs • measurements from industrial sensors and wearable devices Streaming data is a better fit for the way we live.
  • 4. Challenges in Processing Streams: • Event time (rather than data processing time); out of order events • Consistency, fault tolerance, and high availability • Rich forms of window queries; real-time alerts • Low latency and high throughput
  • 5. Hot Reference architecture COLLECT STORE CONSUMEPROCESS / ANALYZEETL Traditional Big Data Pipeline (without Stream Processing) Reference architecture
  • 6. COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FileMessageStreamSearchSQLNoSQLCache RECORDS DOCUMENTS FILES MESSAGES STREAMS Reference architecture ETL Amazon SQS apps Streaming Amazon Kinesis Analytics KCL apps AWS Lambda Amazon Redshift Amazon Machine Learning Presto Amazon EMR FastSlowFast BatchMessageInteractiveStreamML Amazon EC2 Amazon EC2 Amazon EMR Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI Applications Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms Data centers AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail LoggingIoTTransportMessaging AWS Direct Connect AWS IoT Reference architecture
  • 7. COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Elasticsearch Service Apache Kafka Amazon Kinesis Streams Amazon DynamoDB Amazon ElastiCache Amazon DynamoDB Streams HotHotWarm StreamSearchNoSQLCache RECORDS STREAMS Reference architecture ETL Streaming Amazon Kinesis Analytics KCL apps AWS Lambda Fast BatchInteractiveStreamML Amazon EC2 Amazon EMR Apps & Services Analysis&visualizationNotebooksIDEAPI Applications Mobile apps Web apps Devices Message Sensors & IoT platforms Data centers Logging Amazon CloudWatch AWS CloudTrail LoggingIoTTransport AWS Direct Connect AWS IoT Stream Processing Reference architecture Fast
  • 8.
  • 9. “Apache Flink is an open source platform for distributed stream and batch data processing.” Flink is a Stream-First Architecture, that happens to also do batch processing as special case of bounded stream processing.
  • 11. Flink Handles Challenging Scenarios: 1.Application code upgrades 2.Flink version upgrades 3.Maintenance and migration 4.What-if simulations (reinstatements) 5.A/B testing
  • 12. Flink addresses all Streaming Challenges Simultaneously
  • 14. • Amazon Kinesis Streams • Apache Kafka • Elasticsearch • Twitter Streaming API • Cassandra There are connectors for third-party data sources:
  • 15. Where is Flink being used in production?
  • 16. When to use Flink vs Spark Streaming? Flink might be best when: • workload demands true real-time stream processing performance with low latency, high throughput, and fault-tolerance. • not yet heavily invested in Spark Streaming (existing systems, staff training/experience) • want convenience of replaying and reprocessing streams after code/system changes Spark Streaming might be best when: • Primarily do batch processing • Already invested in Spark Streaming (existing deployments, staff) • The micro-batching is acceptable for your workload • Need to code in Python or R • Want to “wait and see” how Flink matures before adopting.
  • 17. How to deploy & run Flink in AWS ?
  • 18. Flink Demo: Analyzing NYC Taxi Rides in Real Time
  • 19. Demo Event Processing Architecture “Replayable Log” Stream Processing Visualization
  • 20. Demo Event Processing Architecture Amazon Kinesis Amazon EMR + EC2 instance (bastion host) Amazon Elasticsearch Service
  • 21. Real-time streaming High throughput; elastic Keeps a ‘replayable log’ of your events Easy to use S3, Redshift, DynamoDB Integrations Amazon Kinesis
  • 22. Dynamically Scalable transient or persistent Hadoop clusters as a service Hadoop, Hive, Spark, Presto, Hbase, … (17 applications) Easy to use; fully managed On demand, reserved, spot pricing HDFS, S3, and Amazon EBS filesystems End to end security: access controls, firewalls, encryption Amazon EMR
  • 23. Provisions & maintains an Elasticsearch cluster (distributed index) Complete ELK stack, including Kibana Fully managed service; zero admin Highly available & reliable Scalable
  • 25. 1. For predominantly stream-processing workloads but also for batch processing, Apache Flink has much to offer: • Simultaneously addresses high throughput, low latency, and fault-tolerance • Do both stream processing & batch processing with a single technology • Powerful windowing functions • Convenient capabilities to pause, restart, and change Flink applications without data loss. 2. Flink is still new and adoption is not as far advanced as Spark Streaming. 3. AWS makes it easy to run streaming workloads with Amazon Kinesis and either Spark Streaming or Flink running on EMR clusters.
  • 28. Approximate Event Time – Kinesis details • Each Amazon Kinesis record includes an ApproximateArrivalTimestamp • The timestamp is set when an Amazon Kinesis stream successfully receives and stores a record • By default the event time of Flink uses this timestamp when reading from a Kinesis stream StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
  • 29. Event Time and Watermarks • With event time the time of an event is determined by the producer • Flink measures progress in event time by means of Watermarks • Watermarks must be ingested to each individual Kinesis shard DataStream<Event> kinesis = env .addSource(new FlinkKinesisConsumer<>(...)) .assignTimestampsAndWatermarks(new PunctuatedAssigner())
  • 30. Data Encryption with Amazon EMR and Flink Security configuration supports encryption • for data stored within the file system • Hadoop Distributed File System (HDFS) block-transfer and RPC • S3 data (SSE-S3, SSE-KMS, CSE-KMS, CSE-Custom) • Local disk (except boot volumes) • In-transit data (no Flink support yet) env.readTextFile("s3://...") env.setStateBackend(new FsStateBackend("hdfs://..."))
  • 31. Connecting to the Flink Dashboard • Use dynamic port forwarding to the Master node ssh -D 8157 hadoop@... • Use FoxyProxy to redirect URLs to localhost *ec2*.amazonaws.com* *.compute.internal* • Navigate to the YARN Resource Manager and select the Tracking UI
  • 32. Starting Flink and Submitting Jobs Use steps to interact with Flink through the AWS API
  • 33. Extending Flink Functionality • Flink Elasticsearch sink merely supports TCP transport • A custom Elasticsearch sink with HTTP support requires only a few dozens lines of code using • Jest (io.searchbox) • aws-signing-request-interceptor (vc.inreach.aws)
  • 34. Amazon SQS apps Streaming Amazon Kinesis Analytics KCL apps AWS Lambda Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FastSlowFast BatchMessageInteractiveStreamML SearchSQLNoSQLCacheFileMessageStream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI Reference architecture LoggingIoTApplicationsTransportMessaging ETL Amazon EMR

Editor's Notes

  1. Things Flink does differently: Throughput, latency, semantics are important Typically you have to chose 2: Storm: low latency, high throughput but at least once. – Spark Streaming: high streaming, exactly once semantics but micro batching, leading to high latency With Flink: first open source project that supports all three: high throughput, low latency since each event is processed separately, exactly once Processing Time vs Event Time: becomes important if you encounter Out of order events: Network issues, latencies can change ordering of events => makes semantics hard => support for event time is crucial! Semantics: You want to have exactly once semantics Rich forms of window queries E.g. Spark Streaming supports windows, even when window is larger then batch interval. (typically microbatch is 0.5s, window might be 1h) => will keep enough microbatches around to satisfy window size. But this falls on its face when duration of window is not known in advance  e.g. impossible to do session queries (but Flink supports this natively) For this use case, these session queries would be interesting: how long is a shift of one taxi driver? (more interesting for analytics layer)
  2. So what exactly do we mean by a Big Data Streaming pipeline architecture? First, what is a traditional Big Data pipeline architecture? Abstract phases in a big data: collection, storage, iterative processing and/or analysis, then consumption of the stored data.
  3. On AWS we have many concrete options to choose from for implementing each of the big data pipeline phases; we won’t go into each of these; just to illustrate the breadth of options
  4. With higher temperature data, we move into the realm of stream processing pipelines several options are available for dealing with hot / fast data Under the Processing / analysis part, our stream-processing options include: Storm,
  5. Main points: - Streaming first! – batch as a special case of Streaming Different way around than Apache Spark, for instance, which focuses on Batch and treats Streams as a continuous series of micro-batches. Typically you have to chose 2: Storm: low latency, high throughput but at least once. – Spark Streaming: high streaming, exactly once semantics but micro batching, leading to high latency With Flink: first open source project that supports all three: high throughput, low latency since each event is processed separately, exactly once Processing Time vs Event Time: becomes important if you encounter Out of order events: Network issues, latencies can change ordering of events => makes semantics hard => support for event time is crucial! Semantics: You want to have exactly once semantics Rich forms of window queries E.g. Spark Streaming supports windows, even when window is larger then batch interval. (typically microbatch is 0.5s, window might be 1h) => will keep enough microbatches around to satisfy window size. But this falls on its face when duration of window is not known in advance  e.g. impossible to do session queries (but Flink supports this natively) For this use case, these session queries would be interesting: how long is a shift of one taxi driver? (more interesting for analytics layer)
  6. * DataStream API provides data structures that represent distri ut YARN as underlying cluster manager => if you want to have a managed YARN cluster, we have you covered! API’s available in Scala and Java, but some new projects working to provide Python access Can run in a distributed manner over hundreds or thousands of machines
  7. Flink framework automatically takes care of correctly restoring the computation in the event of machine and other failures, or intentional reprocessing, as in the case of bug fixes or version upgrades. Exactly Once delivery guarantees Makes it easy for Flink to perform well in production Rich windowing functions that work with event time. Watermarks SavePoints
  8. Historically stream processing has meant you had to settle for 2 out of 3 concerns: high throughtput, latency, and reliability. With Flink can get all 3. Spark Streaming: * Everything is treated as a batch, including streams. Stream of data from continuous events is broken into series of small atomic batch jobs (“micro-batches”) If batches small enough, can approximate true streaming, but latency won’t be true real time. Can lead to fragile pipelines that mix DevOps with app development concerns. Flink: Everything is treated as a batch, including streams. Stream of data from continuous events is broken into series of small atomic batch jobs (“micro-batches”) If batches small enough, can approximate true streaming, but latency won’t be true real time. Can lead to fragile pipelines that mix DevOps with app development concerns.
  9. Real-world data set, no synthetic data Publicly available Shows how to approach a real-world problem!
  10. Log of all events that supports ‘replays’, i.e. random access o our event log (for window queries) Processing: want to have high throughput, exactly once semantics, low latency. Turn data into (perishable) insights Visualization: to make these perishable insights consumable by human users
  11. http://aws.amazon.com/kinesis Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data.
  12. Dynamic port forwarding turns your SSH client into a SOCKS proxy server
  13. Elasticsearch drops TCP transport in v5 – Flink will have to adapt this anyway. Elastic is investing in their own HTTP Client; currently using Jest
  14. Spark encrypted shuffles are currently in the works – But for mapreduce, there is a possibility of enabling end-to-end encrypted data flow/store. Also there is ability to enable server side encryption on S3 bucket level, which is also supported by EMRFS.