SlideShare a Scribd company logo
1 of 49
ML + Streaming
at
FLINK FORWARD EUROPE 2019
Sherin Thomas
@doodlesmt
Stopping a
Phishing Attack
Hello Alex, I’m Tracy calling from Lyft
HQ. This month we’re awarding $200 to
all 4.7+ star drivers. Congratulations!
Hey Tracy, thanks!
Np! And because we see that you’re in
a ride, we’ll dispatch another driver so
you can park at a safe location….
….Alright your passenger will be taken
care of by another driver
Before we can credit you the award, we
just need to quickly verify your identity.
We’ll now send you a verification text.
Can you please tell us what those
numbers are…...
12345
Fingerprinting
Fraudulent
Behaviour
Request Ride
...
Driver Contact
Cancel Ride
…..
Something
User action sequence
Reference: Fingerprinting Fraudulent Behaviour
Reference: Fingerprinting Fraudulent Behaviour
User action sequence
Red Flag
Request Ride
...
Driver Contact
Cancel Ride
…..
Something
User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
String of user action sequence
input to Convolutional Neural
Network
User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
Temporally ordered
User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
Historic context is also important
(large lookback)
User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
Event time processing
Make it very easy to
experiment with, build,
& deploy DS and ML
solutions
1 Model Development
3 Data Quality
5 Compute Resources
2Feature Engineering
4
Scheduling, Execution,
Data Collection
What Data Scientists care about
Data Input Data Prep Modeling Deployment
DATA DISCOVERY
NORMALIZE AND
CLEAN UP DATA
EXTRACT &
TRANSFORM
FEATURES
LABEL DATA
MAINTAIN
EXTERNAL
FEATURE SETS
TRAIN MODELS
EVALUATE AND
OPTIMIZE
DEPLOY
MONITOR &
VISUALIZE
PERFORMANCE
ML Workflow
Data Input Data Prep Modeling Deployment
DATA DISCOVERY
NORMALIZE AND
CLEAN UP DATA
EXTRACT &
TRANSFORM
FEATURES
LABEL DATA
MAINTAIN
EXTERNAL
FEATURE SETS
TRAIN MODELS
EVALUATE AND
OPTIMIZE
DEPLOY
MONITOR &
VISUALIZE
PERFORMANCE
ML Workflow
Needs
● Low latency stateful operations on streaming data - Yay Flink!!
● Event time processing
● Fast feature development - SQL for the win!
● Fully managed - data discovery, compute resource provisioning
User Plane
Dryft UI
Data Plane
Kafka
DynamoDB
Druid
Hive
Control Plane
Query Analysis
Job Cluster
Data Discovery
Dryft! - Self Service Streaming Framework
Elastic Search
Declarative Job Definition
{
“retention”: {},
“lookback”: {},
“stream”: {
“kinesis”: user_activities
},
“features”: {
“user_activity_per_geohash”: {
“type”: “int”
“version”: 1,
“description”: “user activities
per geohash”
}
}
}
Job Config
SELECT
geohash,
COUNT(*) AS total_events,
TUMBLE_END(
rowtime,
INTERVAL ‘1’ hour
)
FROM event_user_action
GROUP BY
TUMBLE(
rowtime,
INTERVAL ‘1’ hour
)
Flink SQL
Feature Fanout
User Apps
Kinesis/
Kafka
Sources
Kinesis
S3
Kinesis/
Kafka
Sinks
DynamoDB
Hive
Eating our own
dogfood
SELECT
-- this will be used in keyBy
CONCAT_WS('_', feature_name, version, id),
feature_data,
CONCAT_WS('_', feature_name, version)
AS feature_definition,
occurred_at
FROM features
Feature Fanout App - also uses Dryft
{
“stream”: {
“kinesis”: feature_stream
},
“sinks”: {
“feature_service_dynamodb”: {
“write_rate”: 1000,
“retry_count”: 5
},
“statsd”: {...}
}
}
Deployment
Previously...
● Ran on AWS EC2 using custom deployment
● Separate autoscaling groups for JobManager and
Taskmanagers
● Instance provisioning done during deployment
● Multiple jobs(60+) running on the same cluster
● Multi-tenancy hell!
One Job Manager to Rule Them All
Future is ... Flink on Kubernetes
Managing Flink on Kubernetes - by Anand and Ketan
● Separate Flink cluster for each application
● Resource allocation per job - at job startup time
● Scales to 100s of Flink applications
● Automatic application updates
Introducing Flink-K8s-Operator
Managing Flink on Kubernetes - by Anand and Ketan
Custom
Resource
Descriptor
Flink Operator
TM TM TM
TM TM TM
JM
Custom Resource Descriptor
apiVersion: flink.k8s.io/v1alpha
kind: FlinkApplication
metadata:
name: flink-speeds-working-stats
namespace: flink
spec:
image: ‘100,dkr.ecr.us-east-1.amazonaws.com/abc:xyz’
flinkJob:
jarName: name.jar
parallelism: 10
taskManagerConfig: {
resources: {
limits: {
memory: 15Gi,
cpu: 4
}
},
replicas: num_task_managers,
taskSlots: NUM_SLOTS_PER_TASK_MANAGER,
envConfig: {...},
}
● Custom resource
represents Flink application
● Docker Image contains
all dependencies
● CRD modifications trigger
update (includes
parallelism and other Flink
configuration properties)
Bootstrapping
SELECT
passenger_id,
COUNT(ride_id)
FROM event_ride_completed
GROUP BY
passenger_id,
HOP(rowtime,
INTERVAL ‘30’ DAY,
INTERVAL ‘1’ HOUR)
What is bootstrapping?
-6
-3
-4
-2
-5
-7
6
4
5
3
2
1
-1
Current Time
Historic Data Future Data
Read historic data to ‘bootstrap’ the program with 30 days worth of data. Now your program returns
results on day 1. But what if the source does not have all 30 days worth of data?
Bootstrap with historic data
Read historic data from persistent store(AWS S3) and streaming data from Kafka/Kinesis
Solution - Consume from two sources
Bootstrapping state in Apache Flink - Hadoop Summit
(historic)
(real-time)
Business
Logic
Sink
< Target Time
>= Target Time
Job starts
Bootstrapping
over
Detect Bootstrap Completion
Job sends a signal to the control plane once watermark has progressed beyond a
point where we no longer need historic data
“Update” Job with lower parallelism but same job graph
Control plane cancels job with savepoint and starts it again from savepoint but
with a much lower parallelism
Start Job
With a higher parallelism for fast bootstrapping
Output volume spike during bootstrapping
Bootstrapping
Output volume spike during bootstrapping
● Features need to be fresh and eventually complete
● Write features produced during bootstrapping separately
What about skew
between historic
and real-time data?
Skew
Watermark =
Solution: Source synchronization
partition 1
partition 2
consumer
partition 3
partition 4
consumer
global watermark
global watermark
global
watermark
shared
state
Reference: FLINK-10887, FLINK-10921, FLIP-27
Now...
● 60+ Flink jobs on Kubernetes powering 120+ features
● Features available in DynamoDB(real time point lookup), Hive(offline analysis),
Druid(real time analysis) and more…
● Time to write, test and deploy a feature is < 1/2 day
Later
● Green/Blue deploy - zero downtime deploys
● “Auto” scaling of Flink cluster and/or job parallelism
● Feature Backfill - for consistent feature generation for model training and scoring
● Feature library
Thank you!

More Related Content

What's hot

Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
 
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...Flink Forward
 
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward
 
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...Flink Forward
 
Scaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkScaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
 
Virtual Flink Forward 2020: Data Warehouse, Data Lakes, What's Next? - Xiaow...
Virtual Flink Forward 2020: Data Warehouse, Data Lakes, What's Next? -  Xiaow...Virtual Flink Forward 2020: Data Warehouse, Data Lakes, What's Next? -  Xiaow...
Virtual Flink Forward 2020: Data Warehouse, Data Lakes, What's Next? - Xiaow...Flink Forward
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward
 
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu QianVirtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu QianFlink Forward
 
Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...
Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...
Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...Flink Forward
 
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward
 
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH Flink Forward
 
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward
 
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupMingmin Chen
 

What's hot (20)

Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
 
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
 
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...Flink Forward San Francisco 2018: Andrew Gao &  Jeff Sharpe - "Finding Bad Ac...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
 
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
 
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
 
Scaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache FlinkScaling stream data pipelines with Pravega and Apache Flink
Scaling stream data pipelines with Pravega and Apache Flink
 
Virtual Flink Forward 2020: Data Warehouse, Data Lakes, What's Next? - Xiaow...
Virtual Flink Forward 2020: Data Warehouse, Data Lakes, What's Next? -  Xiaow...Virtual Flink Forward 2020: Data Warehouse, Data Lakes, What's Next? -  Xiaow...
Virtual Flink Forward 2020: Data Warehouse, Data Lakes, What's Next? - Xiaow...
 
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
 
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...
 
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu QianVirtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
Virtual Flink Forward 2020: Machine learning with Flink in Weibo - Yu Qian
 
Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...
Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...
Flink Forward Berlin 2018: Ravi Suhag & Sumanth Nakshatrithaya - "Managing Fl...
 
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
Flink Forward Berlin 2018: Brian Wolfe - "Upshot: distributed tracing using F...
 
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
 
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...
 
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
 
Flink SQL in Action
Flink SQL in ActionFlink SQL in Action
Flink SQL in Action
 
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 

Similar to Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft

Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Fabian Hueske
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenHuy Nguyen
 
Serverless in production, an experience report (London DevOps)
Serverless in production, an experience report (London DevOps)Serverless in production, an experience report (London DevOps)
Serverless in production, an experience report (London DevOps)Yan Cui
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
 
From training to explainability via git ops
From training to explainability via git opsFrom training to explainability via git ops
From training to explainability via git opsRyan Dawson
 
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardwareSomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardwarealistair_hann
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
 
Working with data using Azure Functions.pdf
Working with data using Azure Functions.pdfWorking with data using Azure Functions.pdf
Working with data using Azure Functions.pdfStephanie Locke
 
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션Amazon Web Services Korea
 
Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flinkKenny Gorman
 
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayI Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayPOSSCON
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-actionAssaf Gannon
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayI Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayAll Things Open
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System ModernisationMongoDB
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Amazon Web Services
 

Similar to Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft (20)

Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...Streaming SQL to unify batch and stream processing: Theory and practice with ...
Streaming SQL to unify batch and stream processing: Theory and practice with ...
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
 
Serverless in production, an experience report (London DevOps)
Serverless in production, an experience report (London DevOps)Serverless in production, an experience report (London DevOps)
Serverless in production, an experience report (London DevOps)
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
From training to explainability via git ops
From training to explainability via git opsFrom training to explainability via git ops
From training to explainability via git ops
 
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardwareSomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
SomeSQL at Skyscanner - Scaling in a changing world of databases and hardware
 
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...
 
Kakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming appKakfa summit london 2019 - the art of the event-streaming app
Kakfa summit london 2019 - the art of the event-streaming app
 
Working with data using Azure Functions.pdf
Working with data using Azure Functions.pdfWorking with data using Azure Functions.pdf
Working with data using Azure Functions.pdf
 
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
 
Streaming sql w kafka and flink
Streaming sql w  kafka and flinkStreaming sql w  kafka and flink
Streaming sql w kafka and flink
 
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayI Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
 
Serverless in-action
Serverless in-actionServerless in-action
Serverless in-action
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP AnywayI Know It Was MEAN, But I Cut the Cord to LAMP Anyway
I Know It Was MEAN, But I Cut the Cord to LAMP Anyway
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System Modernisation
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions:
 
Async
AsyncAsync
Async
 

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkFlink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraFlink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
 
One sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async SinkOne sink to rule them all: Introducing the new Async Sink
One sink to rule them all: Introducing the new Async Sink
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
 
Where is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in FlinkWhere is my bottleneck? Performance troubleshooting in Flink
Where is my bottleneck? Performance troubleshooting in Flink
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production DeploymentUsing the New Apache Flink Kubernetes Operator in a Production Deployment
Using the New Apache Flink Kubernetes Operator in a Production Deployment
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Flink SQL on Pulsar made easy
Flink SQL on Pulsar made easyFlink SQL on Pulsar made easy
Flink SQL on Pulsar made easy
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Processing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial ServicesProcessing Semantically-Ordered Streams in Financial Services
Processing Semantically-Ordered Streams in Financial Services
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

Enabling Machine Learning with Apache Flink - Sherin Thomas, Lyft

  • 1. ML + Streaming at FLINK FORWARD EUROPE 2019 Sherin Thomas @doodlesmt
  • 2.
  • 4. Hello Alex, I’m Tracy calling from Lyft HQ. This month we’re awarding $200 to all 4.7+ star drivers. Congratulations! Hey Tracy, thanks! Np! And because we see that you’re in a ride, we’ll dispatch another driver so you can park at a safe location…. ….Alright your passenger will be taken care of by another driver
  • 5. Before we can credit you the award, we just need to quickly verify your identity. We’ll now send you a verification text. Can you please tell us what those numbers are…... 12345
  • 7. Request Ride ... Driver Contact Cancel Ride ….. Something User action sequence Reference: Fingerprinting Fraudulent Behaviour
  • 8. Reference: Fingerprinting Fraudulent Behaviour User action sequence Red Flag Request Ride ... Driver Contact Cancel Ride ….. Something
  • 9. User action sequence SELECT user_id, TOP(2056, action, occurred_at) OVER ( PARTITION BY user_id ORDER BY event_occurred_at RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action
  • 10. User action sequence SELECT user_id, TOP(2056, action, occurred_at) OVER ( PARTITION BY user_id ORDER BY event_occurred_at RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action String of user action sequence input to Convolutional Neural Network
  • 11. User action sequence SELECT user_id, TOP(2056, action, occurred_at) OVER ( PARTITION BY user_id ORDER BY event_occurred_at RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action Temporally ordered
  • 12. User action sequence SELECT user_id, TOP(2056, action, occurred_at) OVER ( PARTITION BY user_id ORDER BY event_occurred_at RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action Historic context is also important (large lookback)
  • 13. User action sequence SELECT user_id, TOP(2056, action, occurred_at) OVER ( PARTITION BY user_id ORDER BY event_occurred_at RANGE INTERVAL ‘90’ DAYS PRECEDING ) AS client_action_sequence FROM event_user_action Event time processing
  • 14. Make it very easy to experiment with, build, & deploy DS and ML solutions
  • 15.
  • 16. 1 Model Development 3 Data Quality 5 Compute Resources 2Feature Engineering 4 Scheduling, Execution, Data Collection What Data Scientists care about
  • 17. Data Input Data Prep Modeling Deployment DATA DISCOVERY NORMALIZE AND CLEAN UP DATA EXTRACT & TRANSFORM FEATURES LABEL DATA MAINTAIN EXTERNAL FEATURE SETS TRAIN MODELS EVALUATE AND OPTIMIZE DEPLOY MONITOR & VISUALIZE PERFORMANCE ML Workflow
  • 18. Data Input Data Prep Modeling Deployment DATA DISCOVERY NORMALIZE AND CLEAN UP DATA EXTRACT & TRANSFORM FEATURES LABEL DATA MAINTAIN EXTERNAL FEATURE SETS TRAIN MODELS EVALUATE AND OPTIMIZE DEPLOY MONITOR & VISUALIZE PERFORMANCE ML Workflow
  • 19. Needs ● Low latency stateful operations on streaming data - Yay Flink!! ● Event time processing ● Fast feature development - SQL for the win! ● Fully managed - data discovery, compute resource provisioning
  • 20. User Plane Dryft UI Data Plane Kafka DynamoDB Druid Hive Control Plane Query Analysis Job Cluster Data Discovery Dryft! - Self Service Streaming Framework Elastic Search
  • 21. Declarative Job Definition { “retention”: {}, “lookback”: {}, “stream”: { “kinesis”: user_activities }, “features”: { “user_activity_per_geohash”: { “type”: “int” “version”: 1, “description”: “user activities per geohash” } } } Job Config SELECT geohash, COUNT(*) AS total_events, TUMBLE_END( rowtime, INTERVAL ‘1’ hour ) FROM event_user_action GROUP BY TUMBLE( rowtime, INTERVAL ‘1’ hour ) Flink SQL
  • 24. SELECT -- this will be used in keyBy CONCAT_WS('_', feature_name, version, id), feature_data, CONCAT_WS('_', feature_name, version) AS feature_definition, occurred_at FROM features Feature Fanout App - also uses Dryft { “stream”: { “kinesis”: feature_stream }, “sinks”: { “feature_service_dynamodb”: { “write_rate”: 1000, “retry_count”: 5 }, “statsd”: {...} } }
  • 25.
  • 27. Previously... ● Ran on AWS EC2 using custom deployment ● Separate autoscaling groups for JobManager and Taskmanagers ● Instance provisioning done during deployment ● Multiple jobs(60+) running on the same cluster ● Multi-tenancy hell!
  • 28. One Job Manager to Rule Them All
  • 29. Future is ... Flink on Kubernetes Managing Flink on Kubernetes - by Anand and Ketan ● Separate Flink cluster for each application ● Resource allocation per job - at job startup time ● Scales to 100s of Flink applications ● Automatic application updates
  • 30. Introducing Flink-K8s-Operator Managing Flink on Kubernetes - by Anand and Ketan Custom Resource Descriptor Flink Operator TM TM TM TM TM TM JM
  • 31. Custom Resource Descriptor apiVersion: flink.k8s.io/v1alpha kind: FlinkApplication metadata: name: flink-speeds-working-stats namespace: flink spec: image: ‘100,dkr.ecr.us-east-1.amazonaws.com/abc:xyz’ flinkJob: jarName: name.jar parallelism: 10 taskManagerConfig: { resources: { limits: { memory: 15Gi, cpu: 4 } }, replicas: num_task_managers, taskSlots: NUM_SLOTS_PER_TASK_MANAGER, envConfig: {...}, } ● Custom resource represents Flink application ● Docker Image contains all dependencies ● CRD modifications trigger update (includes parallelism and other Flink configuration properties)
  • 34. -6 -3 -4 -2 -5 -7 6 4 5 3 2 1 -1 Current Time Historic Data Future Data Read historic data to ‘bootstrap’ the program with 30 days worth of data. Now your program returns results on day 1. But what if the source does not have all 30 days worth of data? Bootstrap with historic data
  • 35. Read historic data from persistent store(AWS S3) and streaming data from Kafka/Kinesis Solution - Consume from two sources Bootstrapping state in Apache Flink - Hadoop Summit (historic) (real-time) Business Logic Sink < Target Time >= Target Time
  • 36.
  • 39. Detect Bootstrap Completion Job sends a signal to the control plane once watermark has progressed beyond a point where we no longer need historic data “Update” Job with lower parallelism but same job graph Control plane cancels job with savepoint and starts it again from savepoint but with a much lower parallelism Start Job With a higher parallelism for fast bootstrapping
  • 40. Output volume spike during bootstrapping Bootstrapping
  • 41. Output volume spike during bootstrapping ● Features need to be fresh and eventually complete ● Write features produced during bootstrapping separately
  • 42. What about skew between historic and real-time data?
  • 44. Solution: Source synchronization partition 1 partition 2 consumer partition 3 partition 4 consumer global watermark global watermark global watermark shared state Reference: FLINK-10887, FLINK-10921, FLIP-27
  • 46. ● 60+ Flink jobs on Kubernetes powering 120+ features ● Features available in DynamoDB(real time point lookup), Hive(offline analysis), Druid(real time analysis) and more… ● Time to write, test and deploy a feature is < 1/2 day
  • 47. Later
  • 48. ● Green/Blue deploy - zero downtime deploys ● “Auto” scaling of Flink cluster and/or job parallelism ● Feature Backfill - for consistent feature generation for model training and scoring ● Feature library

Editor's Notes

  1. T: 1
  2. Lyft is a ride sharing company. Our mission is to improve people's lives through the world's best transportation. ML models are used for making predictions about ETA, pricing, supply/demand etc, detecting fraud, sending notifications about delays etc. Since these are affected by constantly changing parameters like traffic or weather conditions, it is imperative for our models to draw insights from the current state of the world. To enable these a event or data produced on the platform are ingested by our event ingestion pipeline, processed and persisted after which point it is ready to used for generating features for Machine Learning, as well as other purposes.
  3. This has been discussed here: https://eng.lyft.com/fingerprinting-fraudulent-behavior-6663d0264fad Talking points: Streaming feature that can bootstrap with a large lookback Can keep track of any number of action sequences - 256, 2056 “Event time” is very important in order to get this right. Sequence must be available at query time.
  4. This has been discussed here: https://eng.lyft.com/fingerprinting-fraudulent-behavior-6663d0264fad Talking points: Streaming feature that can bootstrap with a large lookback Can keep track of any number of action sequences - 256, 2056 “Event time” is very important in order to get this right. Sequence must be available at query time.
  5. Clustering of Driver contact and cancel ride near “request ride” event is a red flag.
  6. Neural network alchemy As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
  7. Neural network alchemy As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
  8. Neural network alchemy As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
  9. Neural network alchemy As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
  10. Neural network alchemy As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
  11. Fast feature development - Flink SQL for the win! Consistent feature generation - write once and generate offline training data and real time scoring data Fully managed Self Service - data discovery, compute resource provisioning Fast bootstrapping Reliable
  12. Et Voila - your output starts getting produced immediately and you can visualize it in Kepler Kepler viz to visualize geographic data using Superset. This is enabled because features are written to Druid and HIve that can be queries through Superset. Other types of viz are also available.
  13. Suppose we want to keep a count of the number of rides taken by a user in the last 7 days, and report this every 1 hour. If we were to start running this program today, on realtime data, what do you think will happen? It will take 7 days for this program to process enough data to be able to answer this query correctly. Waiting is the hardest part, what if at the end of the 30 days we realize that this logic needs to change, do we wait another 30 days before verifying results?
  14. This is where bootstrapping comes in. Instead of just processing realtime data when the program starts. We look back in time and use historic data to build up our state. And continue updating the state as new realtime events come in. Now our program can answer queries on Day 1.
  15. This is where bootstrapping comes in. Instead of just processing realtime data when the program starts. We look back in time and use historic data to build up our state. And continue updating the state as new realtime events come in. Now our program can answer queries on Day 1.
  16. Spikey output volume can make it challenging to achieve high throughput during bootstrapping.
  17. We write feature outputs produced during bootstrapping to a different Kinesis stream. This data will be written to the sink separately since we only need it in order to achieve eventual completeness
  18. Since we keep getting events for old windows, we have to keep all the data around in memory this could cause slow checkpoint times (lots of state to save) and eventually could cause us to run out of memory This can get really bad if we’re joining between different data sources, like kinesis and S3 as in the bootstrapping case
  19. Each has a set of threads that each read data from a partition and send it out to the rest of the system We introduce a fixed-sized priority queue inside each consumer thread that’s read by a single thread for the instance The main thread scans queues and pulls the oldest element If a partition is out of sync, it will fill up its queue and stop being consumed from until the other partitions have caught up at which point consumption resumes this bounds the amount of too-new data we might read from our sources