Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

Research on stream and graph processing on Apache Flink presented at the 2nd bay area Apache Flink meetup

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here


  1. 1. Apache Flink Research A look into the future Paris Carbone - PhD Candidate KTH Royal Institute of Technology <,> 1
  2. 2. 2 ’95 Materialised Views ’01 Complex Event Processing ’03 TelegraphCQ ’03 STREAM ’05 Borealis ’15 Advanced Windowing (session, watermarks, user-defined) ’12 Policy-Based Windowing ’88 Active DataBases ’88 HiPac ’12 Twitter Storm ’12 IBM System S ’13 Spark Streaming ’14 Apache Flink ’13 Parallel Recovery ’05 Decentralised Stream Queries ’05 High Availability on Streaming concepts systems ’13 Google Millwheel ’13 Discretized Streams ’00 Eddies 02 Aurora ’15 Google Dataflow
  3. 3. Research in Flink • Many ideas behind Flink were research products • Job plan optimiser • Efficient joins • Memory management • Execution engine always was a streaming engine 3
  4. 4. Our Focus • Contributions already in the current release • Streaming Semantics - Expressive Windowing • State Management (representation, handling) • Graph Semantics - Gelly • Exactly-once-processing (checkpointing) 4
  5. 5. Ongoing Research • Advanced State Management & Fault Tolerance • Pre-aggregate sharing for sliding windows • Streaming ML Pipelines • Streaming Graphs • Experiment Reproducibility 5
  6. 6. Current Focus 6 Streaming APIBatch API Flink Optimiser Flink Runtime Table ML Gelly ML Gelly StateManagement
  7. 7. Lessons Learned from Batch 7 batch-1batch-2 • If a batch computation fails, simply repeat computation as a transaction (if we have repeatable sources) • Transaction rate is constant • Can we apply these principles to a true streaming execution?
  8. 8. Distributed Snapshots 8 t3t2 execution snapshots t1 reset from t2
  9. 9. Taking Snapshots 9 t2t1 execution snapshots Initial approach (see Naiad) • Pause execution on t1,t2,.. • Collect state • Restore execution
  10. 10. Asynchronous Snapshots 10 t2t1 snap - t1 snap - t2 snapshotting snapshotting
  11. 11. Sliding Window Optimisations 11 Managed Memory window operator merge tree Windowing Policies
  12. 12. ML Pipelines 12 training set test set Flink ML ETL Transformers Learners Evaluators training stream test stream Flink Streaming ML stream ETL concept drift detection anomaly detection online learning online classification
  13. 13. ML on Unbounded Data 13 • We are often interested in: • Low latency approximations on a single pass • Instant classification on stream ingestion with higher error bounds • Continous aggregates on unbounded data synopses (e.g. stream sampling)
  14. 14. Streaming ML 14 Streaming APIBatch API Table ML Gelly ML Gelly bounded data multi-pass algorithms bulk classification unbounded data single-pass algorithms instant classification
  15. 15. ML Use Cases 15 Batch ML Streaming ML SVM Anomaly Detection Clustering Concept Drift Detection Col. Filtering (matrix factorisation) Incremental Clustering Rank Estimation Dec. Tree and Rule Mining Similarity Matching Approximations (freq itemsets, distinct items, samples etc.)
  16. 16. Stream ML Abstractions 16 • Reusing the same abstractions from the batch ML library (e.g. Transformer, Learner, Evaluator) • plus some more abstractions (e.g. Drift Detector)
  17. 17. Example: Vertical Hoeffding Trees • Building a decision tree on-the-fly • Parallelizing attribute metric computation (vertical parallelization) 17
  18. 18. input VHT Pipeline Definition 18 input VHT Learner DataPoints Prequential Evaluator Instance Classification
  19. 19. Modelling complex pipelines 19 Transformer Learner Evaluator change reset error
  20. 20. Or even more complex pipelines 20 Transformer Learner Evaluator change error Batch ML Pipeline correct schedule Integrating Batch and Streaming ML
  21. 21. Unbounded Graph Analysis 21 • Graphs are often created by a snapshot of a stream of events: user interactions, product purchases, clicks, etc. • Can we process the graph as a stream, immediately when it arrives in the system? • We can leverage existing research on one-pass streaming algorithms and Flink’s streaming engine
  22. 22. Streaming Graphs? 22 Streaming APIBatch API Table ML Gelly ML Gelly bounded graph data multi-pass algorithms (BSP) exact computations unbounded graph data single-pass algorithms incremental computations
  23. 23. Graph Use Cases 23 Batch Multi-Pass (BSP) Streaming Single-Pass Graph Traversal Degree Estimation Rank Estimation Property Check (Bipartitness/ Connectivity) Connected Components Max Cardinality Matching Shortest Paths Triangle Count
  24. 24. API Preview 24
  25. 25. Example: Top-k Influential Users 25 DataStream<UserID>-topUsers-=-- GraphStream.fromDataStream(new-TwitterSource())- .filterOnVertices(new-FilterUserByFollowers(1000))- .filterOnEdges(new-FilterHashTag(‘#graphs’))- .outDegrees().topK(1,-10);- extracts user data and hashtags from the tweet filter out users with <1000 followers filter out edges with irrelevant hashtags the out-degree will be the number of relevant tweets
  26. 26. Experiments - Reproducibility • Defining, Deploying, orchestrating and collecting results for experiments is a big hustle! • A single experiment will need • devops hours to allocate VMs, fetch the right versions and install system dependencies in the correct order • dev hours to write scripts for data processing/collection • Repeating a benchmark/experiment is impossible without all the low level configuration details 26
  27. 27. Introducing Karamel 27 standalone web app karamel file karamelized cookbooks • Simplifying system dependencies to a bare minimum • Simple integration for existing cookbooks (chef) by adding a Karamel file • Compositional cluster definitions • Tight integration with Github yaml
  28. 28. Introducing Karamel 27 standalone web app karamel file karamelized cookbooks • Simplifying system dependencies to a bare minimum • Simple integration for existing cookbooks (chef) by adding a Karamel file • Compositional cluster definitions • Tight integration with Github yaml
  29. 29. Flink in Karamel 28 • Demo •