Don't Cross The Streams - Data Streaming And Apache Flink

497 views

Published on

Along with the arrival of BigData, a parallel yet less well known but significant change to the way we process data has occurred. Data is getting faster! Business models are changing radically based on the ability to be first to know insights and act appropriately to keep the customer, prevent the breakdown or save the patient. In essence, knowing something now is overriding knowing everything later. Stream processing engines allow us to blend event streams from different internal and external sources to gain insights in real time. This talk will discuss the need for streaming, business models it can change, new applications it allows and why Apache Flink enables these applications. Apache Flink is a top Level Apache Project for real time stream processing at scale. It is a high throughput, low latency, fault tolerant, distributed, state based stream processing engine. Flink has associated Polyglot APIs (Scala, Python, Java) for manipulating streams, a Complex Event Processor for monitoring and alerting on the streams and integration points with other big data ecosystem tooling.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
497
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
16
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Don't Cross The Streams - Data Streaming And Apache Flink

  1. 1. DON'T CROSS THE STREAMS! STREAMING AND APACHE FLINK
  2. 2. Senior Data Consultant Dublin JOHN GORMAN amberhand
  3. 3. WHAT WE WILL COVER It's all about pain! Streaming and Related Terminology Stream Processing Engines Apache Flink
  4. 4. It started with a pain...a so ware pain
  5. 5. Things were big, slow & shaky....and getting worse!
  6. 6. The calm before the storm Batch Processing (High Latency, inability to reason about time) Coupled systems prevented fast delivery of single change requirements Processing large distributed data Messaging incorporated business logic (Service Bus) Customers demanded immediate insight/action Event Ordering/Timing, Consistency, Data Lineage Lack of Fault Tolerant Systems
  7. 7. Someone noticed the need to change some time back...
  8. 8. Oh! The other Michael Hammer...
  9. 9. Ref: Michael Hammer - Harvard Business Review 1990 “We cannot achieve breakthroughs in performance by cutting fat or automating existing processes. Rather, we must challenge old assumptions and shed the old rules that made the business underperform in the first place.”
  10. 10. Ref: Michael Hammer - Harvard Business Review 1990 “These rules of work design are based on assumptions about technology, people, and organisational goals that no longer hold”
  11. 11. So...So ware Legends set out to fix it...
  12. 12. THE PERFECT STORM
  13. 13. Elements of the "Perfect Storm"
  14. 14. Elements of the "Perfect Storm" contd.
  15. 15. Can something save us?
  16. 16. Streams!
  17. 17. flowing from a to a Any event that happens internal or external to your company is fair game for inclusion in a stream! WHAT ARE STREAMS? Unbounded Events Producer Consumer
  18. 18. Streaming obliterates old working habbits, not automates them
  19. 19. When did you last drop a DVD back to your video store ? Convenience of streaming films won out
  20. 20. Anyone using Dublin Bus still carry a timetable? Realtime with Context is needed...
  21. 21. SOME OTHER COMMON STREAM EXAMPLES Log files User website clicks, Finance stocks Social media streams
  22. 22. Ideal Stream Charactristics Low Latency (Time required to produce some result) High Throughput (Number of results produced in time) Persisted for reuse Fault Tolerant Scalable Event Production (i.e. Partitioning) Scaleable Event Consumption (i.e. Consumer Groups) Consumer manages state (offsets) Handle Back Pressure
  23. 23. Benefits of streams Ability to augment and enrich data streams Duality of Streams and Tables (Only Streams Work) Replay from define offset Stream outputs can become stream inputs (unix pipes!) Data first - Processing Later (Fast feature creation) Stream your monitoring (Logs, Ops Metrics, Business KPI etc.)
  24. 24. Benefits of streams contd. Location in Time Testing (Bugs In Code) Replication for Scale Cross/Join prior unrelated sources (i.e. Time, Context - Analytics) Point of Record Stream (produce suitable Materialized Views)
  25. 25. MOST POPULAR STREAMING TOOLS Apache Kafka Amazon Kinesis - Based on Kafka Ideas MapR Streams - Uses Kafka API (adds resilience features)
  26. 26. Can these Streams handle the load ?
  27. 27. Apache Kafka Data Handling at LinkedIn LinkedIn Engineering Blog March 20, 2015
  28. 28. We have the stream! Now what?
  29. 29. Enter the Stream Processing Engine
  30. 30. What is a Stream Processing Engine ?
  31. 31. 8 Requirements of a Real-Time Stream Processing Engine (Michael Stonebraker) 1. Keep the data moving 2. Query using SQL on Stream 3. Handle Stream Imperfections (Delayed, Missing, Out-Of- Order Data) 4. Generate Predictable Outcomes 5. Integrate Stored and Streaming Data 6. Guarantee Data Safety and Availabilty 7. Partition and Scale Applications Automatically 8. Process and Respond Instantaneously
  32. 32. OK - Engines on... What can we do with it ?
  33. 33. Stream Processing Engine - Use Cases Lineage, Auditing, History (Immutable) Internet of Things (Sensor data) Realtime Monitoring (Failure Prevention) Autonomous Cars Fraud/Anomoly Detection Health devices (fitbit, cardio pacemakers etc) For System of record (Infinite persistence) Digital Marketing Network monitoring Realtime pricing / analytics
  34. 34. Stream Processing Engine - Use Cases Contd... Intelligence and Surveillance Risk management (Realtime Asset Coverage) E-commerce (Realtime customer retention) Fraud detection (Card, Insurance) Smart order routing Transaction cost analysis Pricing and analytics Market data management Algorithmic trading Data warehouse augmentation
  35. 35. Streaming does not mandate BigData Streaming does not mandate RealTime processing ...but many application types may mandate either or both
  36. 36. Ok great - Let's dig into an engine...
  37. 37. APACHE FLINK
  38. 38. Apache Flink Components
  39. 39. Apache Flink Architecture Source: DataArtisans (BerlinBuzzwords 2016)
  40. 40. Job Manager UI - (For Job Submission & Monitoring)
  41. 41. Job Manager UI - (Plan and Scheduling)
  42. 42. WAIT! Let's clear a few things up... Pipelining & Backpressure Time Semantics (Event, Injestion, Processing etc.) Windows (count, rolling, session, custom) Watermarks, Triggers (Inserted into stream) Checkpoints (Async Recovery - Choice of state store backend) "Exactly Once" semantics (no need to question if fail on send, process, return?)
  43. 43. Apache Flink - Features out of the box! Support for Event Time and Out-of-Order Events Exactly-once Semantics for Stateful Computations Highly flexible Streaming Windows & CEP Continuous Streaming Model with Backpressure (Buffers) Fault-tolerance via Lightweight Distributed Snapshots One Runtime for Streaming and Batch Processing Memory Management & Custom Serialization Iterations and Delta Iterations Program Optimizer SQL (Batch and Streams) due soon in 1.1
  44. 44. But I'm only here for the Machine Learning and Graph Processing!!...
  45. 45. Machine Learning in Flink with FlinkML * Apache Samoa Project - Streaming Machine Learning that works on top of Flink ** Apache Mahout - Batch based Machine Learning that works on top of Flink
  46. 46. Graph Processing in Flink?
  47. 47. "Gelly" is Apache Flink's Graph Analysis API Iterative Graph processing abstractions on top of Flink 1. Vertex-Centric Iterations (like pregal, giraph) 2. Scatter-Gather Iterations 3. Gather-Sum-Apply (like PowerGraph)
  48. 48. GELLY SUPPORTS 1. Graph Properties (numberOfVerices etc...) 2. Transformations (map, difference, join...) 3. Mutations (Add/Remove vertices/edges...) 4. Batch and Streams - Java, Scala * External "Gradoop" Project adds further features on top of Flink
  49. 49. Graph Processing with Gelly - Algorithms PageRank Single Source Shortest Path Label Propogation Weakly Connected Components Community Detection
  50. 50. Planned Algorithms Triangle Count HITS Affinity Propogation Graph Summarization Planned Algorithms - Attribution: Vasia Kalavri
  51. 51. Ecosystem Integration Data Source/Sinks via Connectors (Kafka, jdbc, S3, etc) Storm and Cascading & MapReduce support Machine Learning - Apache Samoa (Streaming ML), Appache Mahout (Batch) Graph - Gradoop Python API, Scala Repl, Apache Zeppelin Support DataFlow Model - Apache Beam (API Abstraction + Flink "Runner")
  52. 52. Apache Beam - Data Flow Model Support in Flink
  53. 53. Supported Distributions / Deployment Options HortonWorks - Ambari Service (Confirmed full support on the way) Cloudera - Not Supported to my knowledge (Discussion forums ref BigTop) MapR - Not part of their MapR converged data platform Amazon EMR (Yarn - Single Instance, Session) Google Compute Engine (Yarn Support & Hosted Competitor -> Cloud Dataflow) Via Apache Myriad on Mesos (Native support coming in 1.2)
  54. 54. Some DataStream API Code (Setup) * Code courtesy of DataArtisans on github
  55. 55. Some DataStream Code (Destination Sink & Running)
  56. 56. Sometimes, crossing the streams is the solution you need...
  57. 57. Crossing the streams with DataStream API
  58. 58. Crossing the streams with CEP Library
  59. 59. Proposed Flink 1.1 SQL API * Code courtesy of DataArtisans on github
  60. 60. Flink Furthering Yahoo Benchmarks
  61. 61. Apache Flink Adoption
  62. 62. Whats Next For Flink? Queryable State (Database inversion! Kafka log, RocksDB) Release of 1.1+ Dynamic Scaling, Resource Elasticity (i.e. for catchup) Production Hardening (1,000 node cluster Alibaba) Stream SQL (Apache Calcite) CEP Enhancements (large sized async state snapshoting) Mesos Support More Connectors API enhancements (joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)
  63. 63. Email: john.gorman@amberhand.ie LinkedIn: johnpgorman THANK YOU ACKNOWLEDGEMENTS Bank Of Ireland - Event and Venue Hadoop User Group Ireland - Community Building Data Artisans - Images, Code and Community Support Anne Ebeling - Dublin Artwork
  64. 64. RESOURCES APACHE FLINK APACHE FLINK IN FLINK CEP MONITORING RUNNING FLINK ON BY TYLER AKIDAU BY TYLER AKIDAU MAPR FREE EBOOK ON TRAINING TAXI STREAM EXAMPLE BACK PRESSURE CEP SAMPLE YARN STREAMING 101 STREAMING 102 STREAMING ARCHITECTURE

×