Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Stream Analytics - Why they are important

519 views

Published on

Streaming is cool and it can help us do quick analytics and make profit but what about tsunamis? This is a motivation talk presented at the SeRC Big Data Workshop in Sweden during spring 2016. It motivates the streaming paradigm and provides examples on Apache Flink.

Published in: Data & Analytics
  • Be the first to comment

Data Stream Analytics - Why they are important

  1. 1. An Introduction to Data Stream Analytics using Apache Flink SeRC Big Data Workshop Paris Carbone<parisc@kth.se> PhD Candidate KTH Royal Institute of Technology 1
  2. 2. Motivation • Time-critical problems / Actionable Insights • Stock market predictions • Fraud detection • Network security • Fresh customer recommendations 2 more like First-World Problems..
  3. 3. How about Tsunamis 3
  4. 4. 4 Q = Q Deploy Sensors Analyse Data Regularly Collect Data evacuation window earth & wave activity
  5. 5. Motivation 5 Q Q Q =
  6. 6. Motivation 6 Q Standing Query Q = evacuation window
  7. 7. Data Stream Paradigm • Standing queries are evaluated continuously • Input data is unbounded • Queries operate on the full data stream or on the most recent views of the stream ~ windows 7
  8. 8. Data Stream Basics • Events/Tuples : elements of computation - respect a schema • Data Streams : unbounded sequences of events • Stream Operators: consume streams and generate new ones. • Events are consumed once - no backtracking! 8 f S1 S2 So S’1 S’2
  9. 9. Streaming Pipelines 9 stream1 stream2 approximations predictions alerts …… Q sources sinks
  10. 10. Stream Analytics Systems 10 Proprietary Open Source Google DataFlow IBM Infosphere Microsoft Azure Flink Storm Samza Spark
  11. 11. Programming Models 11 Compositional Declarative • Offer basic building blocks for composing custom operators and topologies • Advanced behaviour such as windowing is often missing • Custom Optimisation • Expose a high-level API • Operators are transformations on abstract data types • Advanced behaviour such as windowing is supported • Self-Optimisation
  12. 12. Introducing Apache Flink 0 20 40 60 80 100 120 juli-09 nov-10 apr-12 aug-13 dec-14 maj-16 #unique contributor ids by git commits • A Top-level project • Community-driven open source software development • Publicly open to new contributors
  13. 13. Native Workload Support Apache Flink Stream Pipelines Batch Pipelines Scalable Machine Learning Graph Analytics
  14. 14. 14 The Apache Flink Stack APIs Execution DataStreamDataSet Distributed Dataflow Deployment • Bounded Data Sources • Blocking Operations • Structured Iterations • Unbounded Data Sources • Continuous Operations • Asynchronous Iterations
  15. 15. The Big Picture DataStreamDataSet Distributed Dataflow Deployment Graph-Gelly Table ML HadoopM/R Table CEP SQL SQL ML Graph-Gelly
  16. 16. 16 Basic API Concept Source Data Stream Operator Data Stream Sink Source Data Set Operator Data Set Sink Writing a Flink Program 1.Bootstrap Sources 2.Apply Operators 3.Output to Sinks
  17. 17. Data Streams as Abstract Data Types • Tasks are distributed and run in a pipelined fashion. • State is kept within tasks. • Transformations are applied per-record or window. • Transformations: map, flatmap, filter, union… • Aggregations: reduce, fold, sum • Partitioning: forward, broadcast, shuffle, keyBy • Sources/Sinks: custom or Kafka, Twitter, Collections… 17 DataStream
  18. 18. Example 18 textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .sum(1) .print() “live and let live” “live” “and” “let” “live” (live,1) (and,1) (let,1) (live,1) (live,1) (and,1) (let,1) (live,2)
  19. 19. Working with Windows 19 Why windows? We are often interested in fresh data! Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events! #sec 40 80 SUM #2 0 SUM #1 20 60 100 #sec 40 80 SUM #3 SUM #2 0 SUM #1 20 60 100 120 15 38 65 88 15 38 38 65 65 88 15 38 65 88 110 120 myKeyedStream.timeWindow( Time.seconds(60), Time.seconds(20)); 1) Sliding windows 2) Tumbling windows myKeyedStream.timeWindow( Time.seconds(60)); window buckets/panes
  20. 20. Example 20 textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() “live and” (live,1) (and,1) (let,1) (live,1) counting words over windows “let live” 10:48 11:01 Window (10:45-10:50) Window (11:00-11:05)
  21. 21. Example 21 printwindow sumflatMap textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .print() map where counts are kept in state
  22. 22. Example 22 window sum flatMap textStream .flatMap {_.split("W+")} .map {(_, 1)} .keyBy(0) .timeWindow(Time.minutes(5)) .sum(1) .setParallelism(4) .print() map print
  23. 23. Making State Explicit 23 • Explicitly defined state is durable to failures • Flink supports two types of explicit states • Operator State - full state • Key-Value State - partitioned state per key • State Backends: In-memory, RocksDB, HDFS
  24. 24. Fault Tolerance 24 t2t1 snap - t1 snap - t2 snapshotting snapshotting State is not affected by failures When failures occur we revert computation and state back to a snapshot events Also part of Apache Storm
  25. 25. Performance • Twitter Hack Week - Flink as an in-memory data store 25 Jamie Grier - http://data-artisans.com/extending-the- yahoo-streaming-benchmark/
  26. 26. So how is Flink different that Spark? 26 Two major differences 1) Stream Execution 2) Mutable State
  27. 27. Flink vs Spark 27 (Spark Streaming) put new states in output RDDdstream.updateStateByKey(…) In S’ S • dedicated resources • leased resources • mutable state • immutable state
  28. 28. What about DataSets? 28 • Sophisticated SQL-inspired optimiser • Efficient Join Strategies • Managed Memory bypasses Garbage Collection • Fast, in-memory Iterative Bulk Computations
  29. 29. Some Interesting Libraries 29
  30. 30. Detecting Patterns 30 PatternStream<Event> tsunamiPattern = CEP.pattern(sensorStream, Pattern .begin("seismic").where(evt -> evt.motion.equals(“ClassB”)) .next("tidal").where(evt -> evt.elevation > 500)); DataStream<Alert> result = tsunamiPattern.select( pattern -> { return getEvacuationAlert(pattern); }); CEP Java library Example Scala DSL coming soon
  31. 31. Mining Graphs with Gelly 31 • Iterative Graph Processing • Scatter-Gather • Gather-Sum-Apply • Graph Transformations/Properties • Library Methods: Community Detection, Label Propagation, Connected Components, PageRank.Shortest Paths, Triangle Count etc… Coming Soon : Real-time graph stream support
  32. 32. Machine Learning Pipelines 32 • Scikit-learn inspired pipelining • Supervised: SVM, Linear Regression • Preprocessing: Polynomial Features, Scalers • Recommendation: ALS
  33. 33. Relational Queries 33 Table table = tableEnv.fromDataSet(input); Table filtered = table .groupBy("word") .select("word.count as count, word") .filter("count = 2"); DataSet<WC> result = tableEnv.toDataSet(filtered, WC.class); Table API Example SQL and Stream SQL coming soon
  34. 34. Real-Time Monitoring 34 …for real-time processing
  35. 35. Coming Soon 35 • SQL and Stream SQL • Stream ML • Stream Graph Processing (Gelly-Stream) • Autoscaling • Incremental Snapshots

×