Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JUG Tirana - Introduction to data streaming

This talk will be about the reasons behind the new stream processing model, how it compare to the old batch model, what are their pros and cons, and a list of existing technologies implementing stream processing with their most prominent characteristics. It will contain details of one possible use-case of data streaming that is not possible with batches: display in (near) real-time all trains in Switzerland and their position on a map, beginning with an overview of all the requirements and the design. Finally, using an OpenData endpoint and the Hazelcast platform,showing a working demo implementation of it.

  • Be the first to comment

  • Be the first to like this

JUG Tirana - Introduction to data streaming

  1. 1. @nicolas_frankel A gentle introduction to Stream Processing Nicolas Fränkel
  2. 2. @nicolas_frankel Me, myself and I  Former developer, team lead, architect, blah-blah  Developer Advocate  Lots of crazy creative ideas
  3. 3. @nicolas_frankel Hazelcast HAZELCAST IMDG is an operational, in-memory, distributed computing platform that manages data using in-memory storage and performs execution for breakthrough and scale. HAZELCAST JET is the ultra fast, application embeddable, 3rd generation stream processing engine for low latency batch and stream processing.
  4. 4. @nicolas_frankel Schedule  Why streaming?  Streaming approaches  Hazelcast Jet  Open Data  General Transit Feed Specification  The demo!  Q&A
  5. 5. @nicolas_frankel In a time before our time… Data was neatly stored in SQL databases
  6. 6. @nicolas_frankel The need for Extract Transform Load  Analytics • Supermarket sales in the last hour?  Reporting • Banking account annual closing
  7. 7. @nicolas_frankel What SQL really means  Constraints  Joints  Normal forms
  8. 8. @nicolas_frankel Writes vs. reads  Normalized vs. denormalized  Correct vs. fast
  9. 9. @nicolas_frankel The need for ETL  Different actors  With different needs  Using the same database?
  10. 10. @nicolas_frankel The batch model 1. Extract 2. Transform 3. Load
  11. 11. @nicolas_frankel Batches are everywhere!
  12. 12. @nicolas_frankel Properties of batches  Scheduled at regular intervals • Daily • Weekly • Monthly • Yearly  Run in a specific amount of time
  13. 13. @nicolas_frankel Oops  When the execution time overlaps the next execution schedule  When the space taken by the data exceeds the storage capacity  When the batch fails mid-execution  etc.
  14. 14. @nicolas_frankel Big data!  Parallelize everything • Map - Reduce • Hadoop  NoSQL • Schema on Read vs. Schema on Write
  15. 15. @nicolas_frankel Or chunking?  Keep a cursor • And only manage “chunks” of data  What about new data coming in?
  16. 16. @nicolas_frankel Event-Driven Programming “Programming paradigm in which the flow of the program is determined by events such as user actions (mouse clicks, key presses), sensor outputs, or messages from other programs or threads” -- Wikipedia
  17. 17. @nicolas_frankel Event Sourcing “Event sourcing persists the state of a business entity such an Order or a Customer as a sequence of state- changing events. Whenever the state of a business entity changes, a new event is appended to the list of events. Since saving an event is a single operation, it is inherently atomic. The application reconstructs an entity’s current state by replaying the events.” --
  18. 18. @nicolas_frankel Database internals  Ordered append-only log • e.g. MySQL binlog
  19. 19. @nicolas_frankel Make everything event-based!
  20. 20. @nicolas_frankel Benefits  Memory-friendly  Easily processed  Pull vs. push • Very close to real-time • Keeps derived data in-sync
  21. 21. @nicolas_frankel From finite datasets to infinite
  22. 22. @nicolas_frankel Streaming is smart ETL Processing Ingest In-Memory Operational Storage Combine Join, Enrich, Group, Aggregate Stream Windowing, Event-Time Processing Compute Distributed and Parallel Computation Transform Filter, Clean, Convert Publish In-Memory, Subscriber Notifications Notify if response time is 10% over 24 hour average, second by second
  23. 23. @nicolas_frankel Use Case: Analytics and Decision Making  Real-time dashboards • Decision making • Recommendations  Stats (gaming, infrastructure monitoring)  Prediction - often based on algorithmic prediction • Push stream through ML model  Complex Event Processing
  24. 24. @nicolas_frankel Persistent event-storage systems  Kafka  Pulsar
  25. 25. @nicolas_frankel Kafka  Distributed  On-disk storage  Messages sent and read from a topic • Publish-subscribe • Queue  Consumer can keep track of the offset
  26. 26. @nicolas_frankel In-memory stream processing engines  Apache Flink  Amazon Kinesis  IBM Streams  Hazelcast Jet  Apache Beam • Abstraction over some of the above  …
  27. 27. @nicolas_frankel Hazelcast Jet  Apache 2 Open Source  Single JAR  Leverages Hazelcast IMDG  Unified batch/streaming API  (Hazelcast Jet Enterprise)
  28. 28. @nicolas_frankel Hazelcast Jet
  29. 29. @nicolas_frankel Concept: Pipeline  Declaration (code) that defines and links sources, transforms, and sinks  Platform-specific SDK (Pipeline API in Jet)  Client submits pipeline to the Stream Processing Engine (SPE)
  30. 30. @nicolas_frankel Concept: Job  Running instance of pipeline in SPE  SPE executes the pipeline  Code execution  Data routing  Flow control  Parallel and distributed execution
  31. 31. @nicolas_frankel Imperative model final String text = "..."; final Map<String, Long> counts = new HashMap<>(); for (String word : text.split("W+")) { Long count = counts.get(word); counts.put(count == null ? 1L : count + 1); }
  32. 32. @nicolas_frankel Declarative model Map<String, Long> counts = .map(String::toLowerCase) .flatMap( line ->"W+")) ) .filter(word -> !word.isEmpty()) .collect(Collectors.groupingBy( word -> word, Collectors.counting()) );
  33. 33. @nicolas_frankel What Distributed Means to Hazelcast  Multiple nodes  Scalable storage and performance  Elasticity  Data stored, partitioned and replicated  No single point of failure
  34. 34. @nicolas_frankel Distributed Parallel Processing Pipeline p = Pipeline.create(); p.drawFrom(Sources.<Long, String>map(BOOK_LINES)) .flatMap(line -> traverseArray(line.getValue().split("W+"))) .filter(word -> !word.isEmpty()) .groupingKey(wholeItem()) .aggregate(counting()) .drainTo(; Data Sink Data Source from aggrmap filter to Translate declarative code to a Directed Acyclic Graph
  35. 35. @nicolas_frankel Node 1 Distributed Parallel Processing read cmb map + filter acc sink read cmb map + filter acc Node 2 read cmb map + filter acc sinkread cmb map + filter acc Data Source Data Sink sink sink
  36. 36. @nicolas_frankel Open Data « Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. » --
  37. 37. @nicolas_frankel Some Open Data initiatives  France: •  Switzerland: •  European Union: •
  38. 38. @nicolas_frankel Challenges 1. Access 2. Format 3. Standard 4. Data correctness
  39. 39. @nicolas_frankel Access  Download a file  Access it interactively through a web- service
  40. 40. @nicolas_frankel Format In general, Open Data means Open Format  PDF  CSV  XML  JSON  etc.
  41. 41. @nicolas_frankel Standard  Let’s pretend the format is XML • Which grammar is used?  A shared standard is required • Congruent to a domain
  42. 42. @nicolas_frankel Data correctness "32.TA.66-43","16:20:00","16:20:00","8504304" "32.TA.66-44","24:53:00","24:53:00","8500100" "32.TA.66-44","25:00:00","25:00:00","8500162" "32.TA.66-44","25:02:00","25:02:00","8500170" "32.TA.66-45","23:32:00","23:32:00","8500170"
  43. 43. @nicolas_frankel General Transit Feed Specification ”The General Transit Feed Specification (GTFS) […] defines a common format for public transportation schedules and associated geographic information. GTFS feeds let public transit agencies publish their transit data and developers write applications that consume that data in an interoperable way.”
  44. 44. @nicolas_frankel GTFS static model Filename Required Defines agency.txt Required Transit agencies with service represented in this dataset. stops.txt Required Stops where vehicles pick up or drop off riders. Also defines stations and station entrances. routes.txt Required Transit routes. A route is a group of trips that are displayed to riders as a single service. trips.txt Required Trips for each route. A trip is a sequence of two or more stops that occur during a specific time period. stop_times.txt Required Times that a vehicle arrives at and departs from stops for each trip. calendar.txt Conditionally required Service dates specified using a weekly schedule with start and end dates. This file is required unless all dates of service are defined in calendar_dates.txt. calendar_dates.txt Conditionally required Exceptions for the services defined in the calendar.txt. If calendar.txt is omitted, then calendar_dates.txt is required and must contain all dates of service. fare_attributes.txt Optional Fare information for a transit agency's routes.
  45. 45. @nicolas_frankel GTFS static model Filename Required Defines fare_rules.txt Optional Rules to apply fares for itineraries. shapes.txt Optional Rules for mapping vehicle travel paths, sometimes referred to as route alignments. frequencies.txt Optional Headway (time between trips) for headway-based service or a compressed representation of fixed-schedule service. transfers.txt Optional Rules for making connections at transfer points between routes. pathways.txt Optional Pathways linking together locations within stations. levels.txt Optional Levels within stations. feed_info.txt Optional Dataset metadata, including publisher, version, and expiration information. translations.txt Optional Translated information of a transit agency. attributions.txt Optional Specifies the attributions that are applied to the dataset.
  46. 46. @nicolas_frankel GTFS dynamic model
  47. 47. @nicolas_frankel Use-case: Swiss Public Transport  Open Data  GTFS static available as downloadable .txt files  GTFS dynamic available as a REST endpoint
  48. 48. @nicolas_frankel The available… … data model Where’s the position?!
  49. 49. @nicolas_frankel The dynamic data pipeline 1. Source: web service 2. Split into trip updates 3. Enrich with trip data 4. Enrich with stop times data 5. Transform hours into timestamp 6. Enrich with location data 7. Sink: Hazelcast IMDG
  50. 50. @nicolas_frankel Architecture overview
  51. 51. @nicolas_frankel
  52. 52. @nicolas_frankel Recap  Streaming has a lot of benefits  Leverage Open Data  It’s the Wild West out there • No standards • Real-world data sucks!  But you can get cool stuff done
  53. 53. @nicolas_frankel Thanks a lot!   @nicolas_frankel   