Milan – July 13 2016
Data Intensive Applications with Apache Flink
Simone Robutti
Machine Learning Engineer at Radicalbit
...
Agenda
1. Brief Introduction to Apache Flink
○ Why
○ What
○ How
2. Machine Learning on Flink
○ Present landscape
○ Future ...
100% Buzzword-free guaranteed
Big Data
Machine
Intelligence
Web-scale
400x
It’s like the
human brain
Exactly-once
Exactly-...
Why Flink (and not Spark/Storm/Samza...)
Because it’s
production-ready
streaming-first
low-latency
fault-tolerant
high-thr...
Flink: what is it?
From Flink’s Documentation
Connectors and integrations
Flink’s Runtime
From Flink’s Documentation
Flink’s DataFlow
From Flink’s Documentation
Written by the user through DataSet/DataStream API
Compiled and optimized in t...
Flink’s DataFlow
From Flink’s Documentation
The compiled job is translated to distributed tasks by
the master and executed...
Machine Learning on Flink
Ready and awesome for parallel ML
Work in progress for distributed ML
ML on Flink
Flink for Model Evaluation Pipelines
Source
Data
Preparation
Evaluation Sink
Source
Post
process
-ing
Composable, modular ...
Evaluation with Flink-JPMML
Source
Operator
Flink -
JPMML
Operator
Sink
Operator
Source
Operator
model.pmml
Small library ...
“I have seen people insisting on using Hadoop for
datasets that could easily fit on a flash drive and could
easily be proc...
FlinkML
What: Out-of-the-box workhorse algorithms (ALS,
SVM, LinReg, LogReg …)
Status: early phase, slow development
FlinkML
Pro: available out of the box, written with Flink API
Cons: reinvents the wheel, only a few algorithms,
no model p...
Samsara
What: Linear algebra framework
Status: mature
Samsara
Pro: generic algorithms with platform-specific
bindings, skilled community
Cons: covers only a few use cases
SAMOA
What: Online learning algorithm framework (VHT,
AMR, …)
Status: early phase, complicated relationship with
the indus...
SAMOA
Pro: many powerful generic online learning
algorithms, backed by academics (MOA, Weka)
Cons: not production ready, a...
ML on Flink: the future of the ecosystem
Apache Beam
Programming model for data processing pipelines
● Streaming first, batch as a bounded stream
● Layered API: Wh...
Apache Beam - Runners
● Flink
● Spark (Partial)
● Google Cloud Dataflow
● Plain Java
● Gearpump (WIP)
● Apex (WIP)
BeamML: a runner-agnostic ML library
FlinkML Roadmap
● More algorithms!
● Evaluation framework
● Persistence/export
● Online Learning Framework
Proteus
Online Learning Platform - based on Flink
Source: Proteus’ website
The role of Radicalbit
Contributions
● Cassandra Connector
● Scala API extensions
● FlinkML (Linear Algebra Framework, MinHash)
● Akka Connector
Our vision
Flink can become the ideal choice to build real-time decision-
heavy applications with high data-throughput
To ...
Q&A
THANKS!
Simone Robutti
Mail: simone.robutti@radicalbit.io Medium: @simone.robutti
Twitter: @SimoneRobutti — @weareradicalb...
Data Intensive Applications with Apache Flink
Upcoming SlideShare
Loading in …5
×

Data Intensive Applications with Apache Flink

74 views

Published on

A brief introduction to Apache Flink and an overview of the current possibilities it offers to develop Machine Learning solutions.

Published in: Software
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
74
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Intensive Applications with Apache Flink

  1. 1. Milan – July 13 2016 Data Intensive Applications with Apache Flink Simone Robutti Machine Learning Engineer at Radicalbit @SimoneRobutti
  2. 2. Agenda 1. Brief Introduction to Apache Flink ○ Why ○ What ○ How 2. Machine Learning on Flink ○ Present landscape ○ Future of the Ecosystem 3. Closing notes on Radicalbit (shameless plug ahead)
  3. 3. 100% Buzzword-free guaranteed Big Data Machine Intelligence Web-scale 400x It’s like the human brain Exactly-once Exactly-once
  4. 4. Why Flink (and not Spark/Storm/Samza...) Because it’s production-ready streaming-first low-latency fault-tolerant high-throughput processing engine
  5. 5. Flink: what is it? From Flink’s Documentation
  6. 6. Connectors and integrations
  7. 7. Flink’s Runtime From Flink’s Documentation
  8. 8. Flink’s DataFlow From Flink’s Documentation Written by the user through DataSet/DataStream API Compiled and optimized in the client
  9. 9. Flink’s DataFlow From Flink’s Documentation The compiled job is translated to distributed tasks by the master and executed by workers
  10. 10. Machine Learning on Flink
  11. 11. Ready and awesome for parallel ML Work in progress for distributed ML ML on Flink
  12. 12. Flink for Model Evaluation Pipelines Source Data Preparation Evaluation Sink Source Post process -ing Composable, modular Flink Operator
  13. 13. Evaluation with Flink-JPMML Source Operator Flink - JPMML Operator Sink Operator Source Operator model.pmml Small library that implements basic model eval. Data Preparation
  14. 14. “I have seen people insisting on using Hadoop for datasets that could easily fit on a flash drive and could easily be processed on a laptop.” - Yann LeCun - ML on Flink
  15. 15. FlinkML What: Out-of-the-box workhorse algorithms (ALS, SVM, LinReg, LogReg …) Status: early phase, slow development
  16. 16. FlinkML Pro: available out of the box, written with Flink API Cons: reinvents the wheel, only a few algorithms, no model persistence
  17. 17. Samsara What: Linear algebra framework Status: mature
  18. 18. Samsara Pro: generic algorithms with platform-specific bindings, skilled community Cons: covers only a few use cases
  19. 19. SAMOA What: Online learning algorithm framework (VHT, AMR, …) Status: early phase, complicated relationship with the industry
  20. 20. SAMOA Pro: many powerful generic online learning algorithms, backed by academics (MOA, Weka) Cons: not production ready, academic focus
  21. 21. ML on Flink: the future of the ecosystem
  22. 22. Apache Beam Programming model for data processing pipelines ● Streaming first, batch as a bounded stream ● Layered API: What, Where, When, How ● Platform agnostic: same program, different runners
  23. 23. Apache Beam - Runners ● Flink ● Spark (Partial) ● Google Cloud Dataflow ● Plain Java ● Gearpump (WIP) ● Apex (WIP)
  24. 24. BeamML: a runner-agnostic ML library
  25. 25. FlinkML Roadmap ● More algorithms! ● Evaluation framework ● Persistence/export ● Online Learning Framework
  26. 26. Proteus Online Learning Platform - based on Flink Source: Proteus’ website
  27. 27. The role of Radicalbit
  28. 28. Contributions ● Cassandra Connector ● Scala API extensions ● FlinkML (Linear Algebra Framework, MinHash) ● Akka Connector
  29. 29. Our vision Flink can become the ideal choice to build real-time decision- heavy applications with high data-throughput To achieve this: ● Ambitious applications (aim for real-time services) ● Reliable distributed online learning (Proteus?) ● A Pipelining Framework (experiment fast, increase testability and modularity)
  30. 30. Q&A
  31. 31. THANKS! Simone Robutti Mail: simone.robutti@radicalbit.io Medium: @simone.robutti Twitter: @SimoneRobutti — @weareradicalbit

×