In the world of ride sharing, many decisions such as matching a passenger to the nearest driver, need to be made in realtime. As a result, building and reacting to the current state of the world is imperative. In this presentation I will talk about how we built a Machine Learning platform for realtime feature generation at Lyft using Flink Streaming. I will be covering some of the problems we solved, such as dealing with skew when reading from multiple sources, controlling throughput when talking to micro-services, bootstrapping, as well as other operational aspects of building a production quality ML platform.
4. Hello Alex, I’m Tracy calling from Lyft
HQ. This month we’re awarding $200 to
all 4.7+ star drivers. Congratulations!
Hey Tracy, thanks!
Np! And because we see that you’re in
a ride, we’ll dispatch another driver so
you can park at a safe location….
….Alright your passenger will be taken
care of by another driver
5. Before we can credit you the award, we
just need to quickly verify your identity.
We’ll now send you a verification text.
Can you please tell us what those
numbers are…...
12345
9. User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
10. User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
String of user action sequence
input to Convolutional Neural
Network
11. User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
Temporally ordered
12. User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
Historic context is also important
(large lookback)
13. User action sequence
SELECT
user_id,
TOP(2056, action, occurred_at) OVER (
PARTITION BY user_id
ORDER BY event_occurred_at
RANGE INTERVAL ‘90’ DAYS PRECEDING
) AS client_action_sequence
FROM event_user_action
Event time processing
14. Make it very easy to
experiment with, build,
& deploy DS and ML
solutions
15.
16. 1 Model Development
3 Data Quality
5 Compute Resources
2Feature Engineering
4
Scheduling, Execution,
Data Collection
What Data Scientists care about
17. Data Input Data Prep Modeling Deployment
DATA DISCOVERY
NORMALIZE AND
CLEAN UP DATA
EXTRACT &
TRANSFORM
FEATURES
LABEL DATA
MAINTAIN
EXTERNAL
FEATURE SETS
TRAIN MODELS
EVALUATE AND
OPTIMIZE
DEPLOY
MONITOR &
VISUALIZE
PERFORMANCE
ML Workflow
18. Data Input Data Prep Modeling Deployment
DATA DISCOVERY
NORMALIZE AND
CLEAN UP DATA
EXTRACT &
TRANSFORM
FEATURES
LABEL DATA
MAINTAIN
EXTERNAL
FEATURE SETS
TRAIN MODELS
EVALUATE AND
OPTIMIZE
DEPLOY
MONITOR &
VISUALIZE
PERFORMANCE
ML Workflow
19. Needs
● Low latency stateful operations on streaming data - Yay Flink!!
● Event time processing
● Fast feature development - SQL for the win!
● Fully managed - data discovery, compute resource provisioning
20. User Plane
Dryft UI
Data Plane
Kafka
DynamoDB
Druid
Hive
Control Plane
Query Analysis
Job Cluster
Data Discovery
Dryft! - Self Service Streaming Framework
Elastic Search
27. Previously...
● Ran on AWS EC2 using custom deployment
● Separate autoscaling groups for JobManager and
Taskmanagers
● Instance provisioning done during deployment
● Multiple jobs(60+) running on the same cluster
● Multi-tenancy hell!
29. Future is ... Flink on Kubernetes
Managing Flink on Kubernetes - by Anand and Ketan
● Separate Flink cluster for each application
● Resource allocation per job - at job startup time
● Scales to 100s of Flink applications
● Automatic application updates
34. -6
-3
-4
-2
-5
-7
6
4
5
3
2
1
-1
Current Time
Historic Data Future Data
Read historic data to ‘bootstrap’ the program with 30 days worth of data. Now your program returns
results on day 1. But what if the source does not have all 30 days worth of data?
Bootstrap with historic data
35. Read historic data from persistent store(AWS S3) and streaming data from Kafka/Kinesis
Solution - Consume from two sources
Bootstrapping state in Apache Flink - Hadoop Summit
(historic)
(real-time)
Business
Logic
Sink
< Target Time
>= Target Time
39. Detect Bootstrap Completion
Job sends a signal to the control plane once watermark has progressed beyond a
point where we no longer need historic data
“Update” Job with lower parallelism but same job graph
Control plane cancels job with savepoint and starts it again from savepoint but
with a much lower parallelism
Start Job
With a higher parallelism for fast bootstrapping
41. Output volume spike during bootstrapping
● Features need to be fresh and eventually complete
● Write features produced during bootstrapping separately
46. ● 60+ Flink jobs on Kubernetes powering 120+ features
● Features available in DynamoDB(real time point lookup), Hive(offline analysis),
Druid(real time analysis) and more…
● Time to write, test and deploy a feature is < 1/2 day
48. ● Green/Blue deploy - zero downtime deploys
● “Auto” scaling of Flink cluster and/or job parallelism
● Feature Backfill - for consistent feature generation for model training and scoring
● Feature library
Lyft is a ride sharing company. Our mission is to improve people's lives through the world's best transportation.
ML models are used for making predictions about ETA, pricing, supply/demand etc, detecting fraud, sending notifications about delays etc. Since these are affected by constantly changing parameters like traffic or weather conditions, it is imperative for our models to draw insights from the current state of the world.
To enable these a event or data produced on the platform are ingested by our event ingestion pipeline, processed and persisted after which point it is ready to used for generating features for Machine Learning, as well as other purposes.
This has been discussed here: https://eng.lyft.com/fingerprinting-fraudulent-behavior-6663d0264fad
Talking points:
Streaming feature that can bootstrap with a large lookback
Can keep track of any number of action sequences - 256, 2056
“Event time” is very important in order to get this right.
Sequence must be available at query time.
This has been discussed here: https://eng.lyft.com/fingerprinting-fraudulent-behavior-6663d0264fad
Talking points:
Streaming feature that can bootstrap with a large lookback
Can keep track of any number of action sequences - 256, 2056
“Event time” is very important in order to get this right.
Sequence must be available at query time.
Clustering of Driver contact and cancel ride near “request ride” event is a red flag.
Neural network alchemy
As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
Neural network alchemy
As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
Neural network alchemy
As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
Neural network alchemy
As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
Neural network alchemy
As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎
Fast feature development - Flink SQL for the win!
Consistent feature generation - write once and generate offline training data and real time scoring data
Fully managed
Self Service - data discovery, compute resource provisioning
Fast bootstrapping
Reliable
Et Voila - your output starts getting produced immediately and you can visualize it in Kepler
Kepler viz to visualize geographic data using Superset. This is enabled because features are written to Druid and HIve that can be queries through Superset. Other types of viz are also available.
Suppose we want to keep a count of the number of rides taken by a user in the last 7 days, and report this every 1 hour.
If we were to start running this program today, on realtime data, what do you think will happen?
It will take 7 days for this program to process enough data to be able to answer this query correctly.
Waiting is the hardest part, what if at the end of the 30 days we realize that this logic needs to change, do we wait another 30 days before verifying results?
This is where bootstrapping comes in. Instead of just processing realtime data when the program starts. We look back in time and use historic data to build up our state. And continue updating the state as new realtime events come in. Now our program can answer queries on Day 1.
This is where bootstrapping comes in. Instead of just processing realtime data when the program starts. We look back in time and use historic data to build up our state. And continue updating the state as new realtime events come in. Now our program can answer queries on Day 1.
Spikey output volume can make it challenging to achieve high throughput during bootstrapping.
We write feature outputs produced during bootstrapping to a different Kinesis stream. This data will be written to the sink separately since we only need it in order to achieve eventual completeness
Since we keep getting events for old windows, we have to keep all the data around in memory
this could cause slow checkpoint times (lots of state to save) and eventually could cause us to run out of memory
This can get really bad if we’re joining between different data sources, like kinesis and S3 as in the bootstrapping case
Each has a set of threads that each read data from a partition and send it out to the rest of the system
We introduce a fixed-sized priority queue inside each consumer thread that’s read by a single thread for the instance
The main thread scans queues and pulls the oldest element
If a partition is out of sync, it will fill up its queue and stop being consumed from until the other partitions have caught up
at which point consumption resumes
this bounds the amount of too-new data we might read from our sources