Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Intensive Applications with Apache Flink


Published on

A brief introduction to Apache Flink and an overview of the current possibilities it offers to develop Machine Learning solutions.

Published in: Software
  • Be the first to comment

  • Be the first to like this

Data Intensive Applications with Apache Flink

  1. 1. Milan – July 13 2016 Data Intensive Applications with Apache Flink Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti
  2. 2. Agenda 1. Brief Introduction to Apache Flink ○ Why ○ What ○ How 2. Machine Learning on Flink ○ Present landscape ○ Future of the Ecosystem 3. Closing notes on Radicalbit (shameless plug ahead)
  3. 3. 100% Buzzword-free guaranteed Big Data Machine Intelligence Web-scale 400x It’s like the human brain Exactly-once Exactly-once
  4. 4. Why Flink (and not Spark/Storm/Samza...) Because it’s production-ready streaming-first low-latency fault-tolerant high-throughput processing engine
  5. 5. Flink: what is it? From Flink’s Documentation
  6. 6. Connectors and integrations
  7. 7. Flink’s Runtime From Flink’s Documentation
  8. 8. Flink’s DataFlow From Flink’s Documentation Written by the user through DataSet/DataStream API Compiled and optimized in the client
  9. 9. Flink’s DataFlow From Flink’s Documentation The compiled job is translated to distributed tasks by the master and executed by workers
  10. 10. Machine Learning on Flink
  11. 11. Ready and awesome for parallel ML Work in progress for distributed ML ML on Flink
  12. 12. Flink for Model Evaluation Pipelines Source Data Preparation Evaluation Sink Source Post process -ing Composable, modular Flink Operator
  13. 13. Evaluation with Flink-JPMML Source Operator Flink - JPMML Operator Sink Operator Source Operator model.pmml Small library that implements basic model eval. Data Preparation
  14. 14. “I have seen people insisting on using Hadoop for datasets that could easily fit on a flash drive and could easily be processed on a laptop.” - Yann LeCun - ML on Flink
  15. 15. FlinkML What: Out-of-the-box workhorse algorithms (ALS, SVM, LinReg, LogReg …) Status: early phase, slow development
  16. 16. FlinkML Pro: available out of the box, written with Flink API Cons: reinvents the wheel, only a few algorithms, no model persistence
  17. 17. Samsara What: Linear algebra framework Status: mature
  18. 18. Samsara Pro: generic algorithms with platform-specific bindings, skilled community Cons: covers only a few use cases
  19. 19. SAMOA What: Online learning algorithm framework (VHT, AMR, …) Status: early phase, complicated relationship with the industry
  20. 20. SAMOA Pro: many powerful generic online learning algorithms, backed by academics (MOA, Weka) Cons: not production ready, academic focus
  21. 21. ML on Flink: the future of the ecosystem
  22. 22. Apache Beam Programming model for data processing pipelines ● Streaming first, batch as a bounded stream ● Layered API: What, Where, When, How ● Platform agnostic: same program, different runners
  23. 23. Apache Beam - Runners ● Flink ● Spark (Partial) ● Google Cloud Dataflow ● Plain Java ● Gearpump (WIP) ● Apex (WIP)
  24. 24. BeamML: a runner-agnostic ML library
  25. 25. FlinkML Roadmap ● More algorithms! ● Evaluation framework ● Persistence/export ● Online Learning Framework
  26. 26. Proteus Online Learning Platform - based on Flink Source: Proteus’ website
  27. 27. The role of Radicalbit
  28. 28. Contributions ● Cassandra Connector ● Scala API extensions ● FlinkML (Linear Algebra Framework, MinHash) ● Akka Connector
  29. 29. Our vision Flink can become the ideal choice to build real-time decision- heavy applications with high data-throughput To achieve this: ● Ambitious applications (aim for real-time services) ● Reliable distributed online learning (Proteus?) ● A Pipelining Framework (experiment fast, increase testability and modularity)
  30. 30. Q&A
  31. 31. THANKS! Simone Robutti Mail: Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit