Orchestrating
big data and ML
pipelines at Lyft
Lyft Talks #4
MODERATOR
Olexii Verkunich
Technical Recruiter
overkunich@lyft.com
Join the Ride!
AND GET $5K SIGN-ON
BONUS
(Applicable for ENG openings in Minsk only.
Must apply and/or get a job offer between
Nov 18 and Jan 30 to qualify)
#PrimerosPasos
Constantine Slisenka
SWE (Software Engineer)
cslisenka@lyft.com
4
3
2
Agenda
Orchestration
Airflow, Flyte
Use-cases
big data, ML
Big data at Lyft
scale, ecosystem
?
Flyte live demo Q&A
1
Use-cases
#big data
#ml
Keeping map data accurate and fresh
We have high quality curated map
data from various sources including
open data like OpenStreetMap
We improve the map to make it more
accurate by processing imagery and
gps telemetry for recognizing objects
on the road like road signs, closures,
traffic cones
The impact is a more optimal routes
calculation, better ETA, and trip cost
estimation
Calculating and suggesting pick up spots
We have previous pick up history
We recommend users best nearby
options for pickups
The impact is a better user experience
with less driver friction: pick up in the
optimal locations, more rides due to
fewer cancellations
Detecting missing and inaccurate destinations
We have anomalies in ride
We detect when our data shows
patterns suggesting inaccurate
destinations
The impact is a better user experience,
users are able to effectively find their
destinations, more rides
Route calculation, ETA/price estimation
We have rich map data and
telemetry from driver mobile
devices
We generate real-time speed
profiles and build a
probabilistic model of routes
The impact is better routes,
ETA and price estimates
Forecasting of traffic, demand, and supply
We have driver location data,
information about events, rides
history
We forecast demand and supply,
get understanding of market
balance
The impact is efficient pricing to
serve more rides, more informed
decisions around which incentives
to give drivers (i.e. bonus zones)
Big data at Lyft
#ecosystem
#scale
infrastructure compute engines stream processing
infrastructure compute engines stream processing
development, reporting orchestration, ETL
storage and metadata
?
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
?
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
?
LAST MONTH
analytical
events
?
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
?
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
376K
JOBS LAST MONTH
5M
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
376K
JOBS LAST MONTH
5M
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL 650K
PIPELINE RUNS
24M
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
Orchestration
- Run pipelines (scheduled and ad-hoc)
- Provide python DSL
- Provide integrations with third party
systems (hive, presto, spark, ...)
- Not compute engines
- Good for batch execution
- Not for data streaming
Orchestration engines at Lyft
- Run pipelines
- Provide integrations with thirt
party systems (hive, presto, …)
- Not a compute engines
- Good for batch execution
- Not for data streaming
Orchestration engines at Lyft
What is the difference between Flyte and Airflow?
Why created Flyte?
Why do we use both?
Should I use Flyte or Airflow for my project?
Airflow DAGs
starting from 1.10.0
- Quick and simple to start
- Many integrations (operators)
- Good support for sensor tasks
- Monolithic
- Fixed set of workers
- Does not manage infrastructure
No multi tenancy
- No environment separation:
DEV, STAGE, PROD
- Impossible to set up a
custom libraries and
dependencies per DAG
Limited functionality
- No versioning of DAGs
(no way to compare
outputs of version A vs B)
- No caching of task results
(Airflow is not data aware)
Monolithic scheduler
- Centralized scheduler
becomes a bottleneck
No resource management
- Heavy tasks may
overwhelm worker
- Impossible to set resource
quotas per task like max
memory/CPU
TARS
- Airflow development environment
for testing and backfilling
- Kubernetes pod with ETL software
and libraries and CLI tools
(TARS is also Interstellar movie robot)
Good tool for classic ETLs
using a standard set of operators and
orchestrating third-party systems when
custom environment and multi tenancy
are not required
Q1 2021
Flyte was donated to LF AI &
Data Foundation
Union.ai started
Q2 2020
Spotify and Freenome join
Flyte as collaborators
Q3 2021
15 collaborator
organizations
100+ contributors
Spotify contributes
flytekit-java
Nov, 2019
Flyte was open sourced
at Kubecon!
flyte.org
Nov, 2016
Flyte V0 built for
ETA team at Lyft
Workspace is organized into
projects
Projects or individual tasks have
different environment
Projects are organized into
domains: development,
staging, production
Multi tenancy
Flyte workflows
Language agnostic: can be written in
python and java, any docker image can
be a task
Versioned: each version is a separate
docker image
Data aware: strong typing for inputs and
outputs, Flyte executes tasks based on
data dependencies, results can be cached
Flytekit (Flyte SDK)
operatorframework.io
- Task execution and resource
isolation is managed by Kubernetes
- Throttling and queueing is handled
by Flyte propeller
- Multi tenant
- Allows to have isolated
environment per task or project
- Supports workflow versioning
- Data aware, can cache task results
Overhead
- Ephemeral infrastructure
brings a startup time
overhead
- Teams needs to support
their docker images
Anti patterns
- Table sensing is done much
more elegant in Airflow (use
event-driven approach)
- Not suited for a complex
parallel computation
Good tool if you need a multi tenant
environment, custom dependencies
per task or project, workflow versioning,
and compute isolation is required
- Good for classic ETL jobs
- Quick and simple to start
- Good support for table sensing
- No multi tenancy
- No workflow versioning
- Monolithic, fixed set of workers
- No infrastructure and
environment isolation
- Good for a custom jobs (like ML)
- Multi tenant, api-friendly
- Supports workflow versioning
- Overhead with ephemeral
infrastructure and image
maintenance
- Infrastructure and environment
isolation based on Kubernetes
Choose the right tool for the right job*
(Some cases can be implemented well on both engines)
Flyte
# live demo
Please ask questions in the
chat!
The best question gets
something special from
our speakers
#PrimerosPasos
Raffle Time!
Now Hiring in Minsk and Kyiv!
Backend, Data, ML Engineers
And many more on our careers page! (lyft.com/careers)
Join the Ride!
Connect with us!
LYFT.COM/CAREERS

Lyft talks #4 Orchestrating big data and ML pipelines at Lyft

  • 1.
    Orchestrating big data andML pipelines at Lyft Lyft Talks #4
  • 2.
  • 3.
    Join the Ride! ANDGET $5K SIGN-ON BONUS (Applicable for ENG openings in Minsk only. Must apply and/or get a job offer between Nov 18 and Jan 30 to qualify)
  • 4.
  • 5.
    4 3 2 Agenda Orchestration Airflow, Flyte Use-cases big data,ML Big data at Lyft scale, ecosystem ? Flyte live demo Q&A 1
  • 6.
  • 7.
    Keeping map dataaccurate and fresh We have high quality curated map data from various sources including open data like OpenStreetMap We improve the map to make it more accurate by processing imagery and gps telemetry for recognizing objects on the road like road signs, closures, traffic cones The impact is a more optimal routes calculation, better ETA, and trip cost estimation
  • 8.
    Calculating and suggestingpick up spots We have previous pick up history We recommend users best nearby options for pickups The impact is a better user experience with less driver friction: pick up in the optimal locations, more rides due to fewer cancellations
  • 9.
    Detecting missing andinaccurate destinations We have anomalies in ride We detect when our data shows patterns suggesting inaccurate destinations The impact is a better user experience, users are able to effectively find their destinations, more rides
  • 10.
    Route calculation, ETA/priceestimation We have rich map data and telemetry from driver mobile devices We generate real-time speed profiles and build a probabilistic model of routes The impact is better routes, ETA and price estimates
  • 11.
    Forecasting of traffic,demand, and supply We have driver location data, information about events, rides history We forecast demand and supply, get understanding of market balance The impact is efficient pricing to serve more rides, more informed decisions around which incentives to give drivers (i.e. bonus zones)
  • 12.
    Big data atLyft #ecosystem #scale
  • 13.
  • 14.
    infrastructure compute enginesstream processing development, reporting orchestration, ETL storage and metadata
  • 15.
    ? DATA IN S3 ? JOBSLAST MONTH ? CONTAINERS LAST MONTH ? LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS ? LAST MONTH analytical events
  • 16.
    ? DATA IN S3 ? JOBSLAST MONTH ? CONTAINERS LAST MONTH ? LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 17.
    50PB DATA IN S3 ? JOBSLAST MONTH ? CONTAINERS LAST MONTH +400GB LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 18.
    50PB DATA IN S3 376K JOBSLAST MONTH 5M CONTAINERS LAST MONTH +400GB LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 19.
    50PB DATA IN S3 376K JOBSLAST MONTH 5M CONTAINERS LAST MONTH +400GB LAST MONTH ETL 650K PIPELINE RUNS 24M TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 20.
  • 22.
    - Run pipelines(scheduled and ad-hoc) - Provide python DSL - Provide integrations with third party systems (hive, presto, spark, ...) - Not compute engines - Good for batch execution - Not for data streaming Orchestration engines at Lyft
  • 23.
    - Run pipelines -Provide integrations with thirt party systems (hive, presto, …) - Not a compute engines - Good for batch execution - Not for data streaming Orchestration engines at Lyft
  • 24.
    What is thedifference between Flyte and Airflow? Why created Flyte? Why do we use both? Should I use Flyte or Airflow for my project?
  • 26.
  • 27.
  • 28.
    - Quick andsimple to start - Many integrations (operators) - Good support for sensor tasks - Monolithic - Fixed set of workers - Does not manage infrastructure
  • 29.
    No multi tenancy -No environment separation: DEV, STAGE, PROD - Impossible to set up a custom libraries and dependencies per DAG Limited functionality - No versioning of DAGs (no way to compare outputs of version A vs B) - No caching of task results (Airflow is not data aware)
  • 30.
    Monolithic scheduler - Centralizedscheduler becomes a bottleneck No resource management - Heavy tasks may overwhelm worker - Impossible to set resource quotas per task like max memory/CPU
  • 31.
    TARS - Airflow developmentenvironment for testing and backfilling - Kubernetes pod with ETL software and libraries and CLI tools (TARS is also Interstellar movie robot)
  • 32.
    Good tool forclassic ETLs using a standard set of operators and orchestrating third-party systems when custom environment and multi tenancy are not required
  • 34.
    Q1 2021 Flyte wasdonated to LF AI & Data Foundation Union.ai started Q2 2020 Spotify and Freenome join Flyte as collaborators Q3 2021 15 collaborator organizations 100+ contributors Spotify contributes flytekit-java Nov, 2019 Flyte was open sourced at Kubecon! flyte.org Nov, 2016 Flyte V0 built for ETA team at Lyft
  • 35.
    Workspace is organizedinto projects Projects or individual tasks have different environment Projects are organized into domains: development, staging, production Multi tenancy
  • 36.
    Flyte workflows Language agnostic:can be written in python and java, any docker image can be a task Versioned: each version is a separate docker image Data aware: strong typing for inputs and outputs, Flyte executes tasks based on data dependencies, results can be cached
  • 37.
  • 38.
  • 40.
    - Task executionand resource isolation is managed by Kubernetes - Throttling and queueing is handled by Flyte propeller - Multi tenant - Allows to have isolated environment per task or project - Supports workflow versioning - Data aware, can cache task results
  • 41.
    Overhead - Ephemeral infrastructure bringsa startup time overhead - Teams needs to support their docker images Anti patterns - Table sensing is done much more elegant in Airflow (use event-driven approach) - Not suited for a complex parallel computation
  • 42.
    Good tool ifyou need a multi tenant environment, custom dependencies per task or project, workflow versioning, and compute isolation is required
  • 43.
    - Good forclassic ETL jobs - Quick and simple to start - Good support for table sensing - No multi tenancy - No workflow versioning - Monolithic, fixed set of workers - No infrastructure and environment isolation - Good for a custom jobs (like ML) - Multi tenant, api-friendly - Supports workflow versioning - Overhead with ephemeral infrastructure and image maintenance - Infrastructure and environment isolation based on Kubernetes
  • 44.
    Choose the righttool for the right job* (Some cases can be implemented well on both engines)
  • 45.
  • 46.
    Please ask questionsin the chat! The best question gets something special from our speakers
  • 47.
  • 48.
    Now Hiring inMinsk and Kyiv! Backend, Data, ML Engineers And many more on our careers page! (lyft.com/careers) Join the Ride!
  • 49.

Editor's Notes

  • #8 Basemap Collection (imagery): Yury Kunitski Lev Dragunov Map delivery: Thom Dedecko There is lots of good detection imagery for closures here: https://docs.google.com/document/d/123rpa5nop5OtjLhq-YvUcF6V9jAYgqSu39ayYb7SQ0k/edit and you can find some systems overview stuff here, including image of our fleetview camera: https://docs.google.com/presentation/d/11jWVR_EZ8Vgar2yBAwHmc3CwZLRbdmfjq20sycDFCno/edit#slide=id.g2d1a3323cd_0_58
  • #9 Journey Rendezvous Xiaomeng Chen
  • #10 Journey Data Andrey Kravtsov
  • #11 Mapping (Localization) Artsem Semianenka
  • #12 Market signals, forecasting Matthew Smith
  • #14 Market signals, forecasting Matthew Smith
  • #15 Market signals, forecasting Matthew Smith
  • #27 https://confluence.lyft.net/pages/viewpage.action?pageId=207326078
  • #32 https://blog.container-solutions.com/kubernetes-operators-explained ‘Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop’. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/