DON'T CROSS THE STREAMS!
STREAMING AND APACHE FLINK
Senior Data Consultant
Dublin
JOHN GORMAN
amberhand
WHAT WE WILL COVER
It's all about pain!
Streaming and Related Terminology
Stream Processing Engines
Apache Flink
It started with a pain...a so ware pain
Things were big, slow & shaky....and getting worse!
The calm before the storm
Batch Processing (High Latency, inability to reason about
time)
Coupled systems prevented fast delivery of single change
requirements
Processing large distributed data
Messaging incorporated business logic (Service Bus)
Customers demanded immediate insight/action
Event Ordering/Timing, Consistency, Data Lineage
Lack of Fault Tolerant Systems
Someone noticed the need to change some time back...
Oh! The other Michael Hammer...
Ref: Michael Hammer - Harvard Business Review 1990
“We cannot achieve breakthroughs in
performance by cutting fat or automating
existing processes. Rather, we must
challenge old assumptions and shed the
old rules that made the business
underperform in the first place.”
Ref: Michael Hammer - Harvard Business Review 1990
“These rules of work design are based on
assumptions about technology, people,
and organisational goals that no longer
hold”
So...So ware Legends set out to fix it...
THE PERFECT STORM
Elements of the "Perfect Storm"
Elements of the "Perfect Storm" contd.
Can something save us?
Streams!
flowing from a to a
Any event that happens internal or external to your
company is fair game for inclusion in a stream!
WHAT ARE STREAMS?
Unbounded Events Producer Consumer
Streaming obliterates old working habbits, not automates
them
When did you last drop a DVD back to your video store ?
Convenience of streaming films won out
Anyone using Dublin Bus still carry a timetable?
Realtime with Context is needed...
SOME OTHER COMMON STREAM EXAMPLES
Log files
User website clicks,
Finance stocks
Social media streams
Ideal Stream Charactristics
Low Latency (Time required to produce some result)
High Throughput (Number of results produced in time)
Persisted for reuse
Fault Tolerant
Scalable Event Production (i.e. Partitioning)
Scaleable Event Consumption (i.e. Consumer Groups)
Consumer manages state (offsets)
Handle Back Pressure
Benefits of streams
Ability to augment and enrich data streams
Duality of Streams and Tables (Only Streams Work)
Replay from define offset
Stream outputs can become stream inputs (unix pipes!)
Data first - Processing Later (Fast feature creation)
Stream your monitoring (Logs, Ops Metrics, Business KPI
etc.)
Benefits of streams contd.
Location in Time Testing (Bugs In Code)
Replication for Scale
Cross/Join prior unrelated sources (i.e. Time, Context -
Analytics)
Point of Record Stream (produce suitable Materialized
Views)
MOST POPULAR STREAMING TOOLS
Apache Kafka
Amazon Kinesis - Based on Kafka Ideas
MapR Streams - Uses Kafka API (adds resilience features)
Can these Streams handle the load ?
Apache Kafka Data Handling at LinkedIn
LinkedIn Engineering Blog March 20, 2015
We have the stream! Now what?
Enter the Stream Processing Engine
What is a Stream Processing Engine ?
8 Requirements of a Real-Time Stream Processing Engine
(Michael Stonebraker)
1. Keep the data moving
2. Query using SQL on Stream
3. Handle Stream Imperfections (Delayed, Missing, Out-Of-
Order Data)
4. Generate Predictable Outcomes
5. Integrate Stored and Streaming Data
6. Guarantee Data Safety and Availabilty
7. Partition and Scale Applications Automatically
8. Process and Respond Instantaneously
OK - Engines on... What can we do with it ?
Stream Processing Engine - Use Cases
Lineage, Auditing, History (Immutable)
Internet of Things (Sensor data)
Realtime Monitoring (Failure Prevention)
Autonomous Cars
Fraud/Anomoly Detection
Health devices (fitbit, cardio pacemakers etc)
For System of record (Infinite persistence)
Digital Marketing
Network monitoring
Realtime pricing / analytics
Stream Processing Engine - Use Cases Contd...
Intelligence and Surveillance
Risk management (Realtime Asset Coverage)
E-commerce (Realtime customer retention)
Fraud detection (Card, Insurance)
Smart order routing
Transaction cost analysis
Pricing and analytics
Market data management
Algorithmic trading
Data warehouse augmentation
Streaming does not mandate BigData
Streaming does not mandate RealTime processing
...but many application types may mandate either or both
Ok great - Let's dig into an engine...
APACHE FLINK
Apache Flink Components
Apache Flink Architecture
Source: DataArtisans (BerlinBuzzwords 2016)
Job Manager UI - (For Job Submission & Monitoring)
Job Manager UI - (Plan and Scheduling)
WAIT! Let's clear a few things up...
Pipelining & Backpressure
Time Semantics (Event, Injestion, Processing etc.)
Windows (count, rolling, session, custom)
Watermarks, Triggers (Inserted into stream)
Checkpoints (Async Recovery - Choice of state store
backend)
"Exactly Once" semantics (no need to question if fail on
send, process, return?)
Apache Flink - Features out of the box!
Support for Event Time and Out-of-Order Events
Exactly-once Semantics for Stateful Computations
Highly flexible Streaming Windows & CEP
Continuous Streaming Model with Backpressure (Buffers)
Fault-tolerance via Lightweight Distributed Snapshots
One Runtime for Streaming and Batch Processing
Memory Management & Custom Serialization
Iterations and Delta Iterations
Program Optimizer
SQL (Batch and Streams) due soon in 1.1
But I'm only here for the Machine Learning and Graph
Processing!!...
Machine Learning in Flink with FlinkML
* Apache Samoa Project - Streaming Machine Learning that works on top of Flink
** Apache Mahout - Batch based Machine Learning that works on top of Flink
Graph Processing in Flink?
"Gelly" is Apache Flink's Graph Analysis API
Iterative Graph processing abstractions on top of Flink
1. Vertex-Centric Iterations (like pregal, giraph)
2. Scatter-Gather Iterations
3. Gather-Sum-Apply (like PowerGraph)
GELLY SUPPORTS
1. Graph Properties (numberOfVerices etc...)
2. Transformations (map, difference, join...)
3. Mutations (Add/Remove vertices/edges...)
4. Batch and Streams - Java, Scala
* External "Gradoop" Project adds further features on top of Flink
Graph Processing with Gelly - Algorithms
PageRank
Single Source Shortest Path
Label Propogation
Weakly Connected Components
Community Detection
Planned Algorithms
Triangle Count
HITS
Affinity Propogation
Graph Summarization
Planned Algorithms - Attribution: Vasia Kalavri
Ecosystem Integration
Data Source/Sinks via Connectors (Kafka, jdbc, S3, etc)
Storm and Cascading & MapReduce support
Machine Learning - Apache Samoa (Streaming ML),
Appache Mahout (Batch)
Graph - Gradoop
Python API, Scala Repl, Apache Zeppelin Support
DataFlow Model - Apache Beam (API Abstraction + Flink
"Runner")
Apache Beam - Data Flow Model Support in Flink
Supported Distributions / Deployment Options
HortonWorks - Ambari Service (Confirmed full support on
the way)
Cloudera - Not Supported to my knowledge (Discussion
forums ref BigTop)
MapR - Not part of their MapR converged data platform
Amazon EMR (Yarn - Single Instance, Session)
Google Compute Engine (Yarn Support & Hosted
Competitor -> Cloud Dataflow)
Via Apache Myriad on Mesos (Native support coming in
1.2)
Some DataStream API Code (Setup)
* Code courtesy of DataArtisans on github
Some DataStream Code (Destination Sink & Running)
Sometimes, crossing the streams is the solution you need...
Crossing the streams with DataStream API
Crossing the streams with CEP Library
Proposed Flink 1.1 SQL API
* Code courtesy of DataArtisans on github
Flink Furthering Yahoo Benchmarks
Apache Flink Adoption
Whats Next For Flink?
Queryable State (Database inversion! Kafka log, RocksDB)
Release of 1.1+
Dynamic Scaling, Resource Elasticity (i.e. for catchup)
Production Hardening (1,000 node cluster Alibaba)
Stream SQL (Apache Calcite)
CEP Enhancements (large sized async state snapshoting)
Mesos Support
More Connectors
API enhancements (joins, slowly changing inputs)
Security (data encryption, Kerberos with Kafka)
Email: john.gorman@amberhand.ie
LinkedIn: johnpgorman
THANK YOU
ACKNOWLEDGEMENTS
Bank Of Ireland - Event and Venue
Hadoop User Group Ireland - Community Building
Data Artisans - Images, Code and Community Support
Anne Ebeling - Dublin Artwork
RESOURCES
APACHE FLINK
APACHE FLINK
IN FLINK
CEP MONITORING
RUNNING FLINK ON
BY TYLER AKIDAU
BY TYLER AKIDAU
MAPR FREE EBOOK ON
TRAINING
TAXI STREAM EXAMPLE
BACK PRESSURE CEP
SAMPLE
YARN
STREAMING 101
STREAMING 102
STREAMING ARCHITECTURE

Don't Cross The Streams - Data Streaming And Apache Flink

  • 1.
    DON'T CROSS THESTREAMS! STREAMING AND APACHE FLINK
  • 2.
  • 3.
    WHAT WE WILLCOVER It's all about pain! Streaming and Related Terminology Stream Processing Engines Apache Flink
  • 5.
    It started witha pain...a so ware pain
  • 6.
    Things were big,slow & shaky....and getting worse!
  • 7.
    The calm beforethe storm Batch Processing (High Latency, inability to reason about time) Coupled systems prevented fast delivery of single change requirements Processing large distributed data Messaging incorporated business logic (Service Bus) Customers demanded immediate insight/action Event Ordering/Timing, Consistency, Data Lineage Lack of Fault Tolerant Systems
  • 8.
    Someone noticed theneed to change some time back...
  • 10.
    Oh! The otherMichael Hammer...
  • 11.
    Ref: Michael Hammer- Harvard Business Review 1990 “We cannot achieve breakthroughs in performance by cutting fat or automating existing processes. Rather, we must challenge old assumptions and shed the old rules that made the business underperform in the first place.”
  • 12.
    Ref: Michael Hammer- Harvard Business Review 1990 “These rules of work design are based on assumptions about technology, people, and organisational goals that no longer hold”
  • 13.
    So...So ware Legendsset out to fix it...
  • 14.
  • 15.
    Elements of the"Perfect Storm"
  • 16.
    Elements of the"Perfect Storm" contd.
  • 17.
  • 18.
  • 19.
    flowing from ato a Any event that happens internal or external to your company is fair game for inclusion in a stream! WHAT ARE STREAMS? Unbounded Events Producer Consumer
  • 20.
    Streaming obliterates oldworking habbits, not automates them
  • 21.
    When did youlast drop a DVD back to your video store ? Convenience of streaming films won out
  • 22.
    Anyone using DublinBus still carry a timetable? Realtime with Context is needed...
  • 23.
    SOME OTHER COMMONSTREAM EXAMPLES Log files User website clicks, Finance stocks Social media streams
  • 24.
    Ideal Stream Charactristics LowLatency (Time required to produce some result) High Throughput (Number of results produced in time) Persisted for reuse Fault Tolerant Scalable Event Production (i.e. Partitioning) Scaleable Event Consumption (i.e. Consumer Groups) Consumer manages state (offsets) Handle Back Pressure
  • 25.
    Benefits of streams Abilityto augment and enrich data streams Duality of Streams and Tables (Only Streams Work) Replay from define offset Stream outputs can become stream inputs (unix pipes!) Data first - Processing Later (Fast feature creation) Stream your monitoring (Logs, Ops Metrics, Business KPI etc.)
  • 26.
    Benefits of streamscontd. Location in Time Testing (Bugs In Code) Replication for Scale Cross/Join prior unrelated sources (i.e. Time, Context - Analytics) Point of Record Stream (produce suitable Materialized Views)
  • 27.
    MOST POPULAR STREAMINGTOOLS Apache Kafka Amazon Kinesis - Based on Kafka Ideas MapR Streams - Uses Kafka API (adds resilience features)
  • 28.
    Can these Streamshandle the load ?
  • 29.
    Apache Kafka DataHandling at LinkedIn LinkedIn Engineering Blog March 20, 2015
  • 30.
    We have thestream! Now what?
  • 31.
    Enter the StreamProcessing Engine
  • 32.
    What is aStream Processing Engine ?
  • 33.
    8 Requirements ofa Real-Time Stream Processing Engine (Michael Stonebraker) 1. Keep the data moving 2. Query using SQL on Stream 3. Handle Stream Imperfections (Delayed, Missing, Out-Of- Order Data) 4. Generate Predictable Outcomes 5. Integrate Stored and Streaming Data 6. Guarantee Data Safety and Availabilty 7. Partition and Scale Applications Automatically 8. Process and Respond Instantaneously
  • 34.
    OK - Engineson... What can we do with it ?
  • 35.
    Stream Processing Engine- Use Cases Lineage, Auditing, History (Immutable) Internet of Things (Sensor data) Realtime Monitoring (Failure Prevention) Autonomous Cars Fraud/Anomoly Detection Health devices (fitbit, cardio pacemakers etc) For System of record (Infinite persistence) Digital Marketing Network monitoring Realtime pricing / analytics
  • 36.
    Stream Processing Engine- Use Cases Contd... Intelligence and Surveillance Risk management (Realtime Asset Coverage) E-commerce (Realtime customer retention) Fraud detection (Card, Insurance) Smart order routing Transaction cost analysis Pricing and analytics Market data management Algorithmic trading Data warehouse augmentation
  • 37.
    Streaming does notmandate BigData Streaming does not mandate RealTime processing ...but many application types may mandate either or both
  • 38.
    Ok great -Let's dig into an engine...
  • 39.
  • 40.
  • 41.
    Apache Flink Architecture Source:DataArtisans (BerlinBuzzwords 2016)
  • 42.
    Job Manager UI- (For Job Submission & Monitoring)
  • 43.
    Job Manager UI- (Plan and Scheduling)
  • 44.
    WAIT! Let's cleara few things up... Pipelining & Backpressure Time Semantics (Event, Injestion, Processing etc.) Windows (count, rolling, session, custom) Watermarks, Triggers (Inserted into stream) Checkpoints (Async Recovery - Choice of state store backend) "Exactly Once" semantics (no need to question if fail on send, process, return?)
  • 45.
    Apache Flink -Features out of the box! Support for Event Time and Out-of-Order Events Exactly-once Semantics for Stateful Computations Highly flexible Streaming Windows & CEP Continuous Streaming Model with Backpressure (Buffers) Fault-tolerance via Lightweight Distributed Snapshots One Runtime for Streaming and Batch Processing Memory Management & Custom Serialization Iterations and Delta Iterations Program Optimizer SQL (Batch and Streams) due soon in 1.1
  • 46.
    But I'm onlyhere for the Machine Learning and Graph Processing!!...
  • 47.
    Machine Learning inFlink with FlinkML * Apache Samoa Project - Streaming Machine Learning that works on top of Flink ** Apache Mahout - Batch based Machine Learning that works on top of Flink
  • 48.
  • 49.
    "Gelly" is ApacheFlink's Graph Analysis API Iterative Graph processing abstractions on top of Flink 1. Vertex-Centric Iterations (like pregal, giraph) 2. Scatter-Gather Iterations 3. Gather-Sum-Apply (like PowerGraph)
  • 50.
    GELLY SUPPORTS 1. GraphProperties (numberOfVerices etc...) 2. Transformations (map, difference, join...) 3. Mutations (Add/Remove vertices/edges...) 4. Batch and Streams - Java, Scala * External "Gradoop" Project adds further features on top of Flink
  • 51.
    Graph Processing withGelly - Algorithms PageRank Single Source Shortest Path Label Propogation Weakly Connected Components Community Detection
  • 52.
    Planned Algorithms Triangle Count HITS AffinityPropogation Graph Summarization Planned Algorithms - Attribution: Vasia Kalavri
  • 53.
    Ecosystem Integration Data Source/Sinksvia Connectors (Kafka, jdbc, S3, etc) Storm and Cascading & MapReduce support Machine Learning - Apache Samoa (Streaming ML), Appache Mahout (Batch) Graph - Gradoop Python API, Scala Repl, Apache Zeppelin Support DataFlow Model - Apache Beam (API Abstraction + Flink "Runner")
  • 54.
    Apache Beam -Data Flow Model Support in Flink
  • 55.
    Supported Distributions /Deployment Options HortonWorks - Ambari Service (Confirmed full support on the way) Cloudera - Not Supported to my knowledge (Discussion forums ref BigTop) MapR - Not part of their MapR converged data platform Amazon EMR (Yarn - Single Instance, Session) Google Compute Engine (Yarn Support & Hosted Competitor -> Cloud Dataflow) Via Apache Myriad on Mesos (Native support coming in 1.2)
  • 56.
    Some DataStream APICode (Setup) * Code courtesy of DataArtisans on github
  • 57.
    Some DataStream Code(Destination Sink & Running)
  • 58.
    Sometimes, crossing thestreams is the solution you need...
  • 59.
    Crossing the streamswith DataStream API
  • 60.
    Crossing the streamswith CEP Library
  • 61.
    Proposed Flink 1.1SQL API * Code courtesy of DataArtisans on github
  • 62.
  • 63.
  • 64.
    Whats Next ForFlink? Queryable State (Database inversion! Kafka log, RocksDB) Release of 1.1+ Dynamic Scaling, Resource Elasticity (i.e. for catchup) Production Hardening (1,000 node cluster Alibaba) Stream SQL (Apache Calcite) CEP Enhancements (large sized async state snapshoting) Mesos Support More Connectors API enhancements (joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)
  • 65.
    Email: john.gorman@amberhand.ie LinkedIn: johnpgorman THANKYOU ACKNOWLEDGEMENTS Bank Of Ireland - Event and Venue Hadoop User Group Ireland - Community Building Data Artisans - Images, Code and Community Support Anne Ebeling - Dublin Artwork
  • 66.
    RESOURCES APACHE FLINK APACHE FLINK INFLINK CEP MONITORING RUNNING FLINK ON BY TYLER AKIDAU BY TYLER AKIDAU MAPR FREE EBOOK ON TRAINING TAXI STREAM EXAMPLE BACK PRESSURE CEP SAMPLE YARN STREAMING 101 STREAMING 102 STREAMING ARCHITECTURE