Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

378 views

Published on

"Data intensive applications with Apache Flink" by Simone Robutti, Machine Learning Engineer @ Radicalbit

In the last 10 years, the IT industry has seen a complete revolution in the perceived value that computing has on businesses and how engineers think about applications: in several application domains, the need for data has outgrown the capacity of commodity hardware and the need for information has outpaced traditional processing technologies and approaches. In this talk we'll introduce Apache Flink, a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. It is an open source project that builds on top of proven approaches, as well as innovative algorithms. We will go in-depth on how this tool can be used to implement data-intensive applications, in particular regarding present tools and future perspectives to use machine learning algorithms in a distributed context.

Simone Robutti, 27, Machine Learning Engineer at Radicalbit. He achieved a Master’s Degree at Università degli studi di Milano with a thesis on SVM for noisy labeled datasets. From then on his interests shifted towards the engineering side of Machine Learning and Big Data: implementation, deploy, portability and maintainability of ML-intensive systems. Right now his focus in Radicalbit is Flink and its Machine Learning library FlinkML.

Published in: Data & Analytics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
378
On SlideShare
0
From Embeds
0
Number of Embeds
23
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data intensive applications with Apache Flink - Simone Robutti, Radicalbit

  1. 1. Milan – July 13 2016 Data Intensive Applications with Apache Flink Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti
  2. 2. Agenda 1. Brief Introduction to Apache Flink ○ Why ○ What ○ How 2. Machine Learning on Flink ○ Present landscape ○ Future of the Ecosystem 3. Closing notes on Radicalbit (shameless plug ahead)
  3. 3. 100% Buzzword-free guaranteed Big Data Machine Intelligence Web-scale 400x It’s like the human brain Exactly-once Exactly-once
  4. 4. Why Flink (and not Spark/Storm/Samza...) Because it’s production-ready streaming-first low-latency fault-tolerant high-throughput processing engine
  5. 5. Flink: what is it? From Flink’s Documentation
  6. 6. Connectors and integrations
  7. 7. Flink’s Runtime From Flink’s Documentation
  8. 8. Flink’s DataFlow From Flink’s Documentation Written by the user through DataSet/DataStream API Compiled and optimized in the client
  9. 9. Flink’s DataFlow From Flink’s Documentation The compiled job is translated to distributed tasks by the master and executed by workers
  10. 10. Machine Learning on Flink
  11. 11. Ready and awesome for parallel ML Work in progress for distributed ML ML on Flink
  12. 12. Flink for Model Evaluation Pipelines Source Data Preparation Evaluation Sink Source Post process -ing Composable, modular Flink Operator
  13. 13. Evaluation with Flink-JPMML Source Operator Flink - JPMML Operator Sink Operator Source Operator model.pmml Small library that implements basic model eval. Data Preparation
  14. 14. “I have seen people insisting on using Hadoop for datasets that could easily fit on a flash drive and could easily be processed on a laptop.” - Yann LeCun - ML on Flink
  15. 15. FlinkML What: Out-of-the-box workhorse algorithms (ALS, SVM, LinReg, LogReg …) Status: early phase, slow development
  16. 16. FlinkML Pro: available out of the box, written with Flink API Cons: reinvents the wheel, only a few algorithms, no model persistence
  17. 17. Samsara What: Linear algebra framework Status: mature
  18. 18. Samsara Pro: generic algorithms with platform-specific bindings, skilled community Cons: covers only a few use cases
  19. 19. SAMOA What: Online learning algorithm framework (VHT, AMR, …) Status: early phase, complicated relationship with the industry
  20. 20. SAMOA Pro: many powerful generic online learning algorithms, backed by academics (MOA, Weka) Cons: not production ready, academic focus
  21. 21. ML on Flink: the future of the ecosystem
  22. 22. Apache Beam Programming model for data processing pipelines ● Streaming first, batch as a bounded stream ● Layered API: What, Where, When, How ● Platform agnostic: same program, different runners
  23. 23. Apache Beam - Runners ● Flink ● Spark (Partial) ● Google Cloud Dataflow ● Plain Java ● Gearpump (WIP) ● Apex (WIP)
  24. 24. BeamML: a runner-agnostic ML library
  25. 25. FlinkML Roadmap ● More algorithms! ● Evaluation framework ● Persistence/export ● Online Learning Framework
  26. 26. Proteus Online Learning Platform - based on Flink Source: Proteus’ website
  27. 27. The role of Radicalbit
  28. 28. Contributions ● Cassandra Connector ● Scala API extensions ● FlinkML (Linear Algebra Framework, MinHash) ● Akka Connector
  29. 29. Our vision Flink can become the ideal choice to build real-time decision- heavy applications with high data-throughput To achieve this: ● Ambitious applications (aim for real-time services) ● Reliable distributed online learning (Proteus?) ● A Pipelining Framework (experiment fast, increase testability and modularity)
  30. 30. Q&A
  31. 31. THANKS! Simone Robutti Mail: simone.robutti@radicalbit.io Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit

×