Seif Haridi KTH/RISE
AI @ RISE
Hopsworks, Apache Flink and Beyond
Big Data Analytics Platforms
By KTH and RISE SICS
Hopsworks: End2End Data Platform for Analytics/ML
Datasources
Applications
API
Dashboards
Hopsworks
Apache Beam
Apache Spark Pip
Conda
Tensorflow
scikit-learn
PyTorch
J upyter
Notebooks
Tensorboard
Apache Beam
Apache Spark
Apache Flink
Kubernetes
Batch Distributed
ML &DL
Model
Serving
Hopsworks
Feature Store
Kafka +
Spark
Streaming
Model
Monitoring
Orchestration in Airflow
Data Preparation
&Ingestion
Experimentation
&Model Training
Deploy
&Productionalize
Streaming
Filesystem and Metadata storage
HopsFS
Apache
Kafka
Datasources
Logical Clocks was founded by
the team that created and
continues to drive
Hopsworks a Data-Intensive AI
platform, and its Feature Store,
a warehouse for machine
learning features.
Logical Clocks’ vision is to
simplify the process of
refining data into intelligence
at scale
25
Continuous Intelligence
A design pattern in which real-time analytics are integrated within a business operation,
processing current and historical data to prescribe actions in response to events.
Business
Tech
https://www.gartner.com/en/newsroom/press-releases/2019-02-18-gartner-identifies-top-10-data-and-analytics-technolo
events actions
Paradigm Shift in Data Processing
Data
lots of
Queries
retrospective
answers
Query
lots of
Data
real-time
answers
• Data Stream Processing as a 24/7 execution paradigm
paradigm
shift
6
Stream SQL, CEP…
Kafka, Pub/Sub, Kinesis,
Pravega…
Flink, Beam, Kafka-Streams,
Apex, Storm, Spark
Streaming…
Storage
Compute
High Level
Models
The Real-Time Analytics Stack
Actors vs Streams
vs
Data Stream ComputingActor Programming
• Declarative Programming
• State Managed by the system
• Robust: Built-in Fault Tolerance
• Scalable Deployments
service
logic
service
logic
state log
ic
log
ic
log
ic
log
ic
log
iclogic logic
logic
log
ic
logic
state
• Low-Level Event-Based Programming
• Manual/External State
• Not Robust: Manual Fault Tolerance
• Not flexible scaling
Declarative
Program
service
Stream SQL, CEP…
Kafka, Pub/Sub, Kinesis,
Pravega…
Flink, Beam, Kafka-Streams,
Apex, Storm, Spark Streaming…
Storage
Compute
High Level
Models
8
The Real-Time Analytics Stack
9
Apache Flink Foundations
commercial
deployments
• Top-level Apache Project
• #1 stream processor (2019)
• Production-Proof
• > 400 contributors
• 100s of deployments
Data Streams, Fault Tolerance,
Window Aggregation
Calcite
stream-SQL
influenced
Structure of a 24/7 Application
Event Logs
Historic
Data
Event Logs
Files
Applications/Services
Stream
Processing
State
Program Hierarchy in Flink
11
Dataflow Engine
• Fault Tolerance
• Scalability
• Monitoring/IO Management
Automates
Program Hierarchy in Flink
12
Dataflow Engine
• Fault Tolerance
• Scalability
• Monitoring/IO Management
• Dynamic program state
• Operations on out-of-order streams
Event Processing API f(input, state, time)
Automates
Program Hierarchy in Flink
13
Dataflow Engine
• Fault Tolerance
• Scalability
• Monitoring/IO Management
• Dynamic program state
• Operations on out-of-order streams
Event Processing API f(input, state, time)
DataStream API window,map,filter etc.
• Higher-Order Streaming Functions
• Event Windowing (sessions, time etc.)
Automates
Program Hierarchy in Flink
14
Dataflow Engine
• Fault Tolerance
• Scalability
• Monitoring/IO Management
• Dynamic program state
• Operations on out-of-order streams
Event Processing API f(input, state, time)
DataStream API window,map,filter etc.
• Higher-Order Streaming Functions
• Event Windowing (sessions, time etc.)
SQL, CEP, Tables, ML
• Fully Declarative Programming
• Event Patterns, Relations etc.
Automates
Domain-Specific APIs
15
Declarative Streaming Examples
Average Tip per Hour
with Stream SQL
16
Declarative Streaming Examples
Completed Taxi Rides within
120min with Complex Event
Processing
Example Use Cases
Real-Time Analytics in Action
https://flink.apache.org/poweredby.html
https://www.flink-forward.org/
18
Marketplace - Dynamic Ride Pricing with Apache Flink (2018)
https://marketplace.uber.com/ Flink Forward 2018
• supply
• demand (taxi orders)
• Trips
• Traffic
Compute Location-Sensitive Trends in Rider Demand and Driver Availability
Prices
• Pricing
• Dispatch
• Promotions
• Driver Positioning
Geo-Sensitive Time-based Aggregations
million events
per sec
Input Streams Output Decisions
19
Flink as an Anomaly-Detection Engine
for the Cloud (2018)
• Activity-Based Threat Protection
• Behavioural model/per cloud user
• Detect outliers/suspicious behavior
• Cross-reference suspicious users
• Alert Admins within seconds
We needed a stateful and scalable stream processing framework. We tested everything (Azure ML/Streams,
MS Orlieans, Apache Storm/Samza/Spark/Ignite/Beam etc.) and chose Flink. - Yonatan Most & Avihai
Berkovitz -https://www.slideshare.net/FlinkForward/flink-forward-berlin-2018-yonatan-most-avihai-berkovitz-anomaly-detection-engine-for-cloud-activities-using-flink
8 data clusters. many TB of state
30k events per second
20
Data Streaming at Mass Scale
https://data-artisans.com/blog/blink-flink-alibaba-search
• Biggest Retailer in the world.
• Entire Product Search, A/B Testing, User
Recommendations and Analytics Services are powered by
Blink (fork of Flink).
• 1000s of nodes actively in production.
Continuous Deep Analytics CDA
knowledge
PROCESSING
∞
Data
REASONING
Decision
Making
The goal of the CDA
• Create a Big Data platform
that can leverage complex
real-time decisions based on
massive live data.
Real-Time and Deep Analytics
for Central & Edge Clouds
Our promise and vision
From Real-Time Analytics to
Continuous Deep Analytics
X
Query
live
data
real-time
answers
Deep
Analytics
Historic
Model
historic
data
CDA
system
all
data
critical
decision
making
Live
Model
online
offline
The Continuous Deep Analytics Paradigm Shift
?
?
?
?
The Bigger Picture
24
Data
Processing
• scalable, fault tolerant analytics
• event-based business logic
• out-of-order computation
• dynamic relational tables (SQL)
• event pattern-matching (CEP)
Data Streams
• tensors
• graph algorithms
• deep learning
• feature learning
• reinforcement learning
• ….
but what about deeper analytics…
Data Pipelines Today
•Many Frameworks/Frontends for different needs
•(ML Training & Serving, SQL, Streams, Tensors, Graphs)
25 ⋈
⋈
⋈
σθ
σθ
σθ
σθ
π
π
Streams
Feature Learning
Tensor Programming Dynamic
Graphs
AI ML
RL
Simulation tasks
Reasoning
Feature Engineering
Model Serving
26
Marketplace - Dynamic Ride Pricing with Apache Flink (2018)
https://marketplace.uber.com/ Flink Forward 2018
• supply
• demand (taxi orders)
• Trips
• Traffic
Compute Location-Sensitive Trends in Rider Demand and Driver Availability
Prices
• Pricing
• Dispatch
• Promotions
• Driver Positioning
Geo-Sensitive Time-based Aggregations
million events
per sec
Input Streams Output Decisions
The Problem & Solution
Problem
Data analytics pipelines build on diverse programming models
with hard abstraction boundaries
Performance deteriorates from context switching, steep data
movement costs and excessive type conversions
Solution
A solution is to raise the level of abstraction through an
intermediate representation (IR). The IR is a programming
language that is able to both express and reason about each of the
programming models.
ArconArcon
Arcon
28
The Arcon Vision
Tensors DataFrames DataStreams Graphs
Unified Declarative Programming
Shared Native Execution
Cross-Compile
Optimize
Generated code
The Arcon Architecture
29
Unified Analytics DSL
Arcon Runtime
Arc IR (Intermediate Representation)
30
Arc IR
Translation
Data
Streams
Linear
Algebra
Relational
Algebra
σθ
σθ
π
⋈
Core
DSL
Unified analytics DSL
• Host language-agnostic core
• Compositional
• First-class citizen support for:
• streams, tensors, relations
Stream
Task
The Arc
Intermediate Representation
Graph
Task
Tensor
Task
λ2 λ3λ1
λ1IR λ2IR λ3IR
λ1 + λ2 + λ3
32
Arcon
Arc (High Level IR)
Logical Dataflow IR
Arcon runner
Hardware
Arcon Compiler Pipeline
Dataflow optimizations
Compiler optimizations
Cross-domain optimizations
Rust based runner
Hardware accelerated
Dynamic task execution
CPU/GPU/FPGA
Local & distributed
Dynamic scaling
Arc an IR for expressing and
optimizing computations that
combine stream, relations and
linear algebra
Arcon a general purpose
distributed runtime written in Rust
Arc IR
33
• A minimal yet feature-complete set of read/write-only types and expressions
Arc Optimisations
• Arc supports both compiler and dataflow optimisations
• Compiler: Loop unrolling, partial evaluation,
• Dataflow: Operator fusion, fission, reordering,
specialization, ...
34
Performance
35
v + 3
v + 1 + 1 + 1
v + 1 v + 1 v + 1
v + 1 v + 1 v + 1
Unoptimised
Fused
Partially Evaluated
Inlined
(Task with function)
Performance
36
x2 orders of
magnitude
faster
Unoptimised
Partially Evaluated
Fused
Inlined
• 10M elements mapped 50
times on Apache Flink
• Arc can boost even existing
frameworks
A Runtime Capable for
Unified Analytics
37
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications SOCC 2019
Garefalakis, Karanasos, Pietzuch
Hadoop SparkFlink Arcon
Storm
Performance Matters
• Arc Optimizer : ~10x Speedup
• Shared Hardware Acceleration : ~102x Speedup
• Data Parallel Execution : ~103x Speedup
38
Thanks
• To the CDA and HOPS teams and in general to the
distributed computing group at KTH and RISE SICS
• Please Visit
• DC@KTH https://dcatkth.github.io/
• HOPS https://www.hops.io/
• LogicalClocks https://www.logicalclocks.com/

Big Data Analytics Platforms by KTH and RISE SICS

  • 1.
    Seif Haridi KTH/RISE AI@ RISE Hopsworks, Apache Flink and Beyond Big Data Analytics Platforms By KTH and RISE SICS
  • 3.
    Hopsworks: End2End DataPlatform for Analytics/ML Datasources Applications API Dashboards Hopsworks Apache Beam Apache Spark Pip Conda Tensorflow scikit-learn PyTorch J upyter Notebooks Tensorboard Apache Beam Apache Spark Apache Flink Kubernetes Batch Distributed ML &DL Model Serving Hopsworks Feature Store Kafka + Spark Streaming Model Monitoring Orchestration in Airflow Data Preparation &Ingestion Experimentation &Model Training Deploy &Productionalize Streaming Filesystem and Metadata storage HopsFS Apache Kafka Datasources
  • 4.
    Logical Clocks wasfounded by the team that created and continues to drive Hopsworks a Data-Intensive AI platform, and its Feature Store, a warehouse for machine learning features. Logical Clocks’ vision is to simplify the process of refining data into intelligence at scale
  • 5.
    25 Continuous Intelligence A designpattern in which real-time analytics are integrated within a business operation, processing current and historical data to prescribe actions in response to events. Business Tech https://www.gartner.com/en/newsroom/press-releases/2019-02-18-gartner-identifies-top-10-data-and-analytics-technolo events actions
  • 6.
    Paradigm Shift inData Processing Data lots of Queries retrospective answers Query lots of Data real-time answers • Data Stream Processing as a 24/7 execution paradigm paradigm shift 6 Stream SQL, CEP… Kafka, Pub/Sub, Kinesis, Pravega… Flink, Beam, Kafka-Streams, Apex, Storm, Spark Streaming… Storage Compute High Level Models The Real-Time Analytics Stack
  • 7.
    Actors vs Streams vs DataStream ComputingActor Programming • Declarative Programming • State Managed by the system • Robust: Built-in Fault Tolerance • Scalable Deployments service logic service logic state log ic log ic log ic log ic log iclogic logic logic log ic logic state • Low-Level Event-Based Programming • Manual/External State • Not Robust: Manual Fault Tolerance • Not flexible scaling Declarative Program service
  • 8.
    Stream SQL, CEP… Kafka,Pub/Sub, Kinesis, Pravega… Flink, Beam, Kafka-Streams, Apex, Storm, Spark Streaming… Storage Compute High Level Models 8 The Real-Time Analytics Stack
  • 9.
    9 Apache Flink Foundations commercial deployments •Top-level Apache Project • #1 stream processor (2019) • Production-Proof • > 400 contributors • 100s of deployments Data Streams, Fault Tolerance, Window Aggregation Calcite stream-SQL influenced
  • 10.
    Structure of a24/7 Application Event Logs Historic Data Event Logs Files Applications/Services Stream Processing State
  • 11.
    Program Hierarchy inFlink 11 Dataflow Engine • Fault Tolerance • Scalability • Monitoring/IO Management Automates
  • 12.
    Program Hierarchy inFlink 12 Dataflow Engine • Fault Tolerance • Scalability • Monitoring/IO Management • Dynamic program state • Operations on out-of-order streams Event Processing API f(input, state, time) Automates
  • 13.
    Program Hierarchy inFlink 13 Dataflow Engine • Fault Tolerance • Scalability • Monitoring/IO Management • Dynamic program state • Operations on out-of-order streams Event Processing API f(input, state, time) DataStream API window,map,filter etc. • Higher-Order Streaming Functions • Event Windowing (sessions, time etc.) Automates
  • 14.
    Program Hierarchy inFlink 14 Dataflow Engine • Fault Tolerance • Scalability • Monitoring/IO Management • Dynamic program state • Operations on out-of-order streams Event Processing API f(input, state, time) DataStream API window,map,filter etc. • Higher-Order Streaming Functions • Event Windowing (sessions, time etc.) SQL, CEP, Tables, ML • Fully Declarative Programming • Event Patterns, Relations etc. Automates Domain-Specific APIs
  • 15.
    15 Declarative Streaming Examples AverageTip per Hour with Stream SQL
  • 16.
    16 Declarative Streaming Examples CompletedTaxi Rides within 120min with Complex Event Processing
  • 17.
    Example Use Cases Real-TimeAnalytics in Action https://flink.apache.org/poweredby.html https://www.flink-forward.org/
  • 18.
    18 Marketplace - DynamicRide Pricing with Apache Flink (2018) https://marketplace.uber.com/ Flink Forward 2018 • supply • demand (taxi orders) • Trips • Traffic Compute Location-Sensitive Trends in Rider Demand and Driver Availability Prices • Pricing • Dispatch • Promotions • Driver Positioning Geo-Sensitive Time-based Aggregations million events per sec Input Streams Output Decisions
  • 19.
    19 Flink as anAnomaly-Detection Engine for the Cloud (2018) • Activity-Based Threat Protection • Behavioural model/per cloud user • Detect outliers/suspicious behavior • Cross-reference suspicious users • Alert Admins within seconds We needed a stateful and scalable stream processing framework. We tested everything (Azure ML/Streams, MS Orlieans, Apache Storm/Samza/Spark/Ignite/Beam etc.) and chose Flink. - Yonatan Most & Avihai Berkovitz -https://www.slideshare.net/FlinkForward/flink-forward-berlin-2018-yonatan-most-avihai-berkovitz-anomaly-detection-engine-for-cloud-activities-using-flink 8 data clusters. many TB of state 30k events per second
  • 20.
    20 Data Streaming atMass Scale https://data-artisans.com/blog/blink-flink-alibaba-search • Biggest Retailer in the world. • Entire Product Search, A/B Testing, User Recommendations and Analytics Services are powered by Blink (fork of Flink). • 1000s of nodes actively in production.
  • 21.
    Continuous Deep AnalyticsCDA knowledge PROCESSING ∞ Data REASONING Decision Making The goal of the CDA • Create a Big Data platform that can leverage complex real-time decisions based on massive live data.
  • 22.
    Real-Time and DeepAnalytics for Central & Edge Clouds Our promise and vision From Real-Time Analytics to Continuous Deep Analytics X Query live data real-time answers Deep Analytics Historic Model historic data CDA system all data critical decision making Live Model online offline The Continuous Deep Analytics Paradigm Shift
  • 23.
    ? ? ? ? The Bigger Picture 24 Data Processing •scalable, fault tolerant analytics • event-based business logic • out-of-order computation • dynamic relational tables (SQL) • event pattern-matching (CEP) Data Streams • tensors • graph algorithms • deep learning • feature learning • reinforcement learning • …. but what about deeper analytics…
  • 24.
    Data Pipelines Today •ManyFrameworks/Frontends for different needs •(ML Training & Serving, SQL, Streams, Tensors, Graphs) 25 ⋈ ⋈ ⋈ σθ σθ σθ σθ π π Streams Feature Learning Tensor Programming Dynamic Graphs AI ML RL Simulation tasks Reasoning Feature Engineering Model Serving
  • 25.
    26 Marketplace - DynamicRide Pricing with Apache Flink (2018) https://marketplace.uber.com/ Flink Forward 2018 • supply • demand (taxi orders) • Trips • Traffic Compute Location-Sensitive Trends in Rider Demand and Driver Availability Prices • Pricing • Dispatch • Promotions • Driver Positioning Geo-Sensitive Time-based Aggregations million events per sec Input Streams Output Decisions
  • 26.
    The Problem &Solution Problem Data analytics pipelines build on diverse programming models with hard abstraction boundaries Performance deteriorates from context switching, steep data movement costs and excessive type conversions Solution A solution is to raise the level of abstraction through an intermediate representation (IR). The IR is a programming language that is able to both express and reason about each of the programming models.
  • 27.
    ArconArcon Arcon 28 The Arcon Vision TensorsDataFrames DataStreams Graphs Unified Declarative Programming Shared Native Execution Cross-Compile Optimize Generated code
  • 28.
    The Arcon Architecture 29 UnifiedAnalytics DSL Arcon Runtime Arc IR (Intermediate Representation)
  • 29.
    30 Arc IR Translation Data Streams Linear Algebra Relational Algebra σθ σθ π ⋈ Core DSL Unified analyticsDSL • Host language-agnostic core • Compositional • First-class citizen support for: • streams, tensors, relations
  • 30.
  • 31.
    32 Arcon Arc (High LevelIR) Logical Dataflow IR Arcon runner Hardware Arcon Compiler Pipeline Dataflow optimizations Compiler optimizations Cross-domain optimizations Rust based runner Hardware accelerated Dynamic task execution CPU/GPU/FPGA Local & distributed Dynamic scaling Arc an IR for expressing and optimizing computations that combine stream, relations and linear algebra Arcon a general purpose distributed runtime written in Rust
  • 32.
    Arc IR 33 • Aminimal yet feature-complete set of read/write-only types and expressions
  • 33.
    Arc Optimisations • Arcsupports both compiler and dataflow optimisations • Compiler: Loop unrolling, partial evaluation, • Dataflow: Operator fusion, fission, reordering, specialization, ... 34
  • 34.
    Performance 35 v + 3 v+ 1 + 1 + 1 v + 1 v + 1 v + 1 v + 1 v + 1 v + 1 Unoptimised Fused Partially Evaluated Inlined (Task with function)
  • 35.
    Performance 36 x2 orders of magnitude faster Unoptimised PartiallyEvaluated Fused Inlined • 10M elements mapped 50 times on Apache Flink • Arc can boost even existing frameworks
  • 36.
    A Runtime Capablefor Unified Analytics 37 Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications SOCC 2019 Garefalakis, Karanasos, Pietzuch Hadoop SparkFlink Arcon Storm
  • 37.
    Performance Matters • ArcOptimizer : ~10x Speedup • Shared Hardware Acceleration : ~102x Speedup • Data Parallel Execution : ~103x Speedup 38
  • 38.
    Thanks • To theCDA and HOPS teams and in general to the distributed computing group at KTH and RISE SICS • Please Visit • DC@KTH https://dcatkth.github.io/ • HOPS https://www.hops.io/ • LogicalClocks https://www.logicalclocks.com/