SlideShare a Scribd company logo
1 of 49
Orchestrating
big data and ML
pipelines at Lyft
Lyft Talks #4
MODERATOR
Olexii Verkunich
Technical Recruiter
overkunich@lyft.com
Join the Ride!
AND GET $5K SIGN-ON
BONUS
(Applicable for ENG openings in Minsk only.
Must apply and/or get a job offer between
Nov 18 and Jan 30 to qualify)
#PrimerosPasos
Constantine Slisenka
SWE (Software Engineer)
cslisenka@lyft.com
4
3
2
Agenda
Orchestration
Airflow, Flyte
Use-cases
big data, ML
Big data at Lyft
scale, ecosystem
?
Flyte live demo Q&A
1
Use-cases
#big data
#ml
Keeping map data accurate and fresh
We have high quality curated map
data from various sources including
open data like OpenStreetMap
We improve the map to make it more
accurate by processing imagery and
gps telemetry for recognizing objects
on the road like road signs, closures,
traffic cones
The impact is a more optimal routes
calculation, better ETA, and trip cost
estimation
Calculating and suggesting pick up spots
We have previous pick up history
We recommend users best nearby
options for pickups
The impact is a better user experience
with less driver friction: pick up in the
optimal locations, more rides due to
fewer cancellations
Detecting missing and inaccurate destinations
We have anomalies in ride
We detect when our data shows
patterns suggesting inaccurate
destinations
The impact is a better user experience,
users are able to effectively find their
destinations, more rides
Route calculation, ETA/price estimation
We have rich map data and
telemetry from driver mobile
devices
We generate real-time speed
profiles and build a
probabilistic model of routes
The impact is better routes,
ETA and price estimates
Forecasting of traffic, demand, and supply
We have driver location data,
information about events, rides
history
We forecast demand and supply,
get understanding of market
balance
The impact is efficient pricing to
serve more rides, more informed
decisions around which incentives
to give drivers (i.e. bonus zones)
Big data at Lyft
#ecosystem
#scale
infrastructure compute engines stream processing
infrastructure compute engines stream processing
development, reporting orchestration, ETL
storage and metadata
?
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
?
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
?
LAST MONTH
analytical
events
?
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
?
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
?
JOBS LAST MONTH
?
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
376K
JOBS LAST MONTH
5M
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL ?
PIPELINE RUNS
?
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
50PB
DATA IN S3
376K
JOBS LAST MONTH
5M
CONTAINERS LAST MONTH
+400GB
LAST MONTH
ETL 650K
PIPELINE RUNS
24M
TASK EXECUTIONS
8 773 789 938 145
LAST MONTH
analytical
events
Orchestration
- Run pipelines (scheduled and ad-hoc)
- Provide python DSL
- Provide integrations with third party
systems (hive, presto, spark, ...)
- Not compute engines
- Good for batch execution
- Not for data streaming
Orchestration engines at Lyft
- Run pipelines
- Provide integrations with thirt
party systems (hive, presto, …)
- Not a compute engines
- Good for batch execution
- Not for data streaming
Orchestration engines at Lyft
What is the difference between Flyte and Airflow?
Why created Flyte?
Why do we use both?
Should I use Flyte or Airflow for my project?
Airflow DAGs
starting from 1.10.0
- Quick and simple to start
- Many integrations (operators)
- Good support for sensor tasks
- Monolithic
- Fixed set of workers
- Does not manage infrastructure
No multi tenancy
- No environment separation:
DEV, STAGE, PROD
- Impossible to set up a
custom libraries and
dependencies per DAG
Limited functionality
- No versioning of DAGs
(no way to compare
outputs of version A vs B)
- No caching of task results
(Airflow is not data aware)
Monolithic scheduler
- Centralized scheduler
becomes a bottleneck
No resource management
- Heavy tasks may
overwhelm worker
- Impossible to set resource
quotas per task like max
memory/CPU
TARS
- Airflow development environment
for testing and backfilling
- Kubernetes pod with ETL software
and libraries and CLI tools
(TARS is also Interstellar movie robot)
Good tool for classic ETLs
using a standard set of operators and
orchestrating third-party systems when
custom environment and multi tenancy
are not required
Q1 2021
Flyte was donated to LF AI &
Data Foundation
Union.ai started
Q2 2020
Spotify and Freenome join
Flyte as collaborators
Q3 2021
15 collaborator
organizations
100+ contributors
Spotify contributes
flytekit-java
Nov, 2019
Flyte was open sourced
at Kubecon!
flyte.org
Nov, 2016
Flyte V0 built for
ETA team at Lyft
Workspace is organized into
projects
Projects or individual tasks have
different environment
Projects are organized into
domains: development,
staging, production
Multi tenancy
Flyte workflows
Language agnostic: can be written in
python and java, any docker image can
be a task
Versioned: each version is a separate
docker image
Data aware: strong typing for inputs and
outputs, Flyte executes tasks based on
data dependencies, results can be cached
Flytekit (Flyte SDK)
operatorframework.io
- Task execution and resource
isolation is managed by Kubernetes
- Throttling and queueing is handled
by Flyte propeller
- Multi tenant
- Allows to have isolated
environment per task or project
- Supports workflow versioning
- Data aware, can cache task results
Overhead
- Ephemeral infrastructure
brings a startup time
overhead
- Teams needs to support
their docker images
Anti patterns
- Table sensing is done much
more elegant in Airflow (use
event-driven approach)
- Not suited for a complex
parallel computation
Good tool if you need a multi tenant
environment, custom dependencies
per task or project, workflow versioning,
and compute isolation is required
- Good for classic ETL jobs
- Quick and simple to start
- Good support for table sensing
- No multi tenancy
- No workflow versioning
- Monolithic, fixed set of workers
- No infrastructure and
environment isolation
- Good for a custom jobs (like ML)
- Multi tenant, api-friendly
- Supports workflow versioning
- Overhead with ephemeral
infrastructure and image
maintenance
- Infrastructure and environment
isolation based on Kubernetes
Choose the right tool for the right job*
(Some cases can be implemented well on both engines)
Flyte
# live demo
Please ask questions in the
chat!
The best question gets
something special from
our speakers
#PrimerosPasos
Raffle Time!
Now Hiring in Minsk and Kyiv!
Backend, Data, ML Engineers
And many more on our careers page! (lyft.com/careers)
Join the Ride!
Connect with us!
LYFT.COM/CAREERS

More Related Content

What's hot

CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaAmazee Labs
 
Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...
Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...
Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...HostedbyConfluent
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?Wojciech Barczyński
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracingTracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracingYuri Shkuro
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDatabricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPconfluent
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to RootsBrendan Gregg
 

What's hot (20)

CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 
Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...
Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...
Keep Your Cache Always Fresh with Debezium! with Gunnar Morling | Kafka Summi...
 
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
 
How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?How to monitor your micro-service with Prometheus?
How to monitor your micro-service with Prometheus?
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracingTracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
Tracing 2000+ polyglot microservices at Uber with Jaeger and OpenTracing
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second(BDT318) How Netflix Handles Up To 8 Million Events Per Second
(BDT318) How Netflix Handles Up To 8 Million Events Per Second
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCPBridge to Cloud: Using Apache Kafka to Migrate to GCP
Bridge to Cloud: Using Apache Kafka to Migrate to GCP
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 

Similar to Orchestrating big data and ML pipelines at Lyft

Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesMichael Stephenson
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixJeff Magnusson
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
 
Webinar september 2013
Webinar september 2013Webinar september 2013
Webinar september 2013Marc Gille
 
AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"
AGIT 2015  - Hans Viehmann: "Big Data and Smart Cities"AGIT 2015  - Hans Viehmann: "Big Data and Smart Cities"
AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"jstrobl
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisAlex Henthorn-Iwane
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7Paul Lo
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United AirlinesDataWorks Summit
 
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
Discover How Volvo Cars Uses a Time Series Database to Become Data-DrivenDiscover How Volvo Cars Uses a Time Series Database to Become Data-Driven
Discover How Volvo Cars Uses a Time Series Database to Become Data-DrivenDevOps.com
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunk
 

Similar to Orchestrating big data and ML pipelines at Lyft (20)

Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Lipstick On Pig
Lipstick On Pig Lipstick On Pig
Lipstick On Pig
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson Netflix - Pig with Lipstick by Jeff Magnusson
Netflix - Pig with Lipstick by Jeff Magnusson
 
Webinar september 2013
Webinar september 2013Webinar september 2013
Webinar september 2013
 
Enterprise Data Lakes
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
 
AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"
AGIT 2015  - Hans Viehmann: "Big Data and Smart Cities"AGIT 2015  - Hans Viehmann: "Big Data and Smart Cities"
AGIT 2015 - Hans Viehmann: "Big Data and Smart Cities"
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
AnilKumarT_Resume_latest
AnilKumarT_Resume_latestAnilKumarT_Resume_latest
AnilKumarT_Resume_latest
 
Cloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow AnalysisCloud-Scale BGP and NetFlow Analysis
Cloud-Scale BGP and NetFlow Analysis
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Modern Monitoring
Modern MonitoringModern Monitoring
Modern Monitoring
 
Big data at United Airlines
Big data at United AirlinesBig data at United Airlines
Big data at United Airlines
 
CV_Gervano_Fernandes
CV_Gervano_FernandesCV_Gervano_Fernandes
CV_Gervano_Fernandes
 
DevOps as a Contract
DevOps as a ContractDevOps as a Contract
DevOps as a Contract
 
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
Discover How Volvo Cars Uses a Time Series Database to Become Data-DrivenDiscover How Volvo Cars Uses a Time Series Database to Become Data-Driven
Discover How Volvo Cars Uses a Time Series Database to Become Data-Driven
 
Presto
PrestoPresto
Presto
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 

More from Constantine Slisenka

Unlocking the secrets of successful architects: what skills and traits do you...
Unlocking the secrets of successful architects: what skills and traits do you...Unlocking the secrets of successful architects: what skills and traits do you...
Unlocking the secrets of successful architects: what skills and traits do you...Constantine Slisenka
 
What does it take to be architect (for Cjicago JUG)
What does it take to be architect (for Cjicago JUG)What does it take to be architect (for Cjicago JUG)
What does it take to be architect (for Cjicago JUG)Constantine Slisenka
 
What does it take to be an architect
What does it take to be an architectWhat does it take to be an architect
What does it take to be an architectConstantine Slisenka
 
VoxxedDays Minsk - Building scalable WebSocket backend
VoxxedDays Minsk - Building scalable WebSocket backendVoxxedDays Minsk - Building scalable WebSocket backend
VoxxedDays Minsk - Building scalable WebSocket backendConstantine Slisenka
 
Building scalable web socket backend
Building scalable web socket backendBuilding scalable web socket backend
Building scalable web socket backendConstantine Slisenka
 
Latency tracing in distributed Java applications
Latency tracing in distributed Java applicationsLatency tracing in distributed Java applications
Latency tracing in distributed Java applicationsConstantine Slisenka
 
Distributed transactions in SOA and Microservices
Distributed transactions in SOA and MicroservicesDistributed transactions in SOA and Microservices
Distributed transactions in SOA and MicroservicesConstantine Slisenka
 
Best practices of building data streaming API
Best practices of building data streaming APIBest practices of building data streaming API
Best practices of building data streaming APIConstantine Slisenka
 
Database transaction isolation and locking in Java
Database transaction isolation and locking in JavaDatabase transaction isolation and locking in Java
Database transaction isolation and locking in JavaConstantine Slisenka
 
Networking in Java with NIO and Netty
Networking in Java with NIO and NettyNetworking in Java with NIO and Netty
Networking in Java with NIO and NettyConstantine Slisenka
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applicationsConstantine Slisenka
 

More from Constantine Slisenka (11)

Unlocking the secrets of successful architects: what skills and traits do you...
Unlocking the secrets of successful architects: what skills and traits do you...Unlocking the secrets of successful architects: what skills and traits do you...
Unlocking the secrets of successful architects: what skills and traits do you...
 
What does it take to be architect (for Cjicago JUG)
What does it take to be architect (for Cjicago JUG)What does it take to be architect (for Cjicago JUG)
What does it take to be architect (for Cjicago JUG)
 
What does it take to be an architect
What does it take to be an architectWhat does it take to be an architect
What does it take to be an architect
 
VoxxedDays Minsk - Building scalable WebSocket backend
VoxxedDays Minsk - Building scalable WebSocket backendVoxxedDays Minsk - Building scalable WebSocket backend
VoxxedDays Minsk - Building scalable WebSocket backend
 
Building scalable web socket backend
Building scalable web socket backendBuilding scalable web socket backend
Building scalable web socket backend
 
Latency tracing in distributed Java applications
Latency tracing in distributed Java applicationsLatency tracing in distributed Java applications
Latency tracing in distributed Java applications
 
Distributed transactions in SOA and Microservices
Distributed transactions in SOA and MicroservicesDistributed transactions in SOA and Microservices
Distributed transactions in SOA and Microservices
 
Best practices of building data streaming API
Best practices of building data streaming APIBest practices of building data streaming API
Best practices of building data streaming API
 
Database transaction isolation and locking in Java
Database transaction isolation and locking in JavaDatabase transaction isolation and locking in Java
Database transaction isolation and locking in Java
 
Networking in Java with NIO and Netty
Networking in Java with NIO and NettyNetworking in Java with NIO and Netty
Networking in Java with NIO and Netty
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applications
 

Recently uploaded

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 

Recently uploaded (20)

Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 

Orchestrating big data and ML pipelines at Lyft

  • 1. Orchestrating big data and ML pipelines at Lyft Lyft Talks #4
  • 3. Join the Ride! AND GET $5K SIGN-ON BONUS (Applicable for ENG openings in Minsk only. Must apply and/or get a job offer between Nov 18 and Jan 30 to qualify)
  • 5. 4 3 2 Agenda Orchestration Airflow, Flyte Use-cases big data, ML Big data at Lyft scale, ecosystem ? Flyte live demo Q&A 1
  • 7. Keeping map data accurate and fresh We have high quality curated map data from various sources including open data like OpenStreetMap We improve the map to make it more accurate by processing imagery and gps telemetry for recognizing objects on the road like road signs, closures, traffic cones The impact is a more optimal routes calculation, better ETA, and trip cost estimation
  • 8. Calculating and suggesting pick up spots We have previous pick up history We recommend users best nearby options for pickups The impact is a better user experience with less driver friction: pick up in the optimal locations, more rides due to fewer cancellations
  • 9. Detecting missing and inaccurate destinations We have anomalies in ride We detect when our data shows patterns suggesting inaccurate destinations The impact is a better user experience, users are able to effectively find their destinations, more rides
  • 10. Route calculation, ETA/price estimation We have rich map data and telemetry from driver mobile devices We generate real-time speed profiles and build a probabilistic model of routes The impact is better routes, ETA and price estimates
  • 11. Forecasting of traffic, demand, and supply We have driver location data, information about events, rides history We forecast demand and supply, get understanding of market balance The impact is efficient pricing to serve more rides, more informed decisions around which incentives to give drivers (i.e. bonus zones)
  • 12. Big data at Lyft #ecosystem #scale
  • 13. infrastructure compute engines stream processing
  • 14. infrastructure compute engines stream processing development, reporting orchestration, ETL storage and metadata
  • 15. ? DATA IN S3 ? JOBS LAST MONTH ? CONTAINERS LAST MONTH ? LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS ? LAST MONTH analytical events
  • 16. ? DATA IN S3 ? JOBS LAST MONTH ? CONTAINERS LAST MONTH ? LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 17. 50PB DATA IN S3 ? JOBS LAST MONTH ? CONTAINERS LAST MONTH +400GB LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 18. 50PB DATA IN S3 376K JOBS LAST MONTH 5M CONTAINERS LAST MONTH +400GB LAST MONTH ETL ? PIPELINE RUNS ? TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 19. 50PB DATA IN S3 376K JOBS LAST MONTH 5M CONTAINERS LAST MONTH +400GB LAST MONTH ETL 650K PIPELINE RUNS 24M TASK EXECUTIONS 8 773 789 938 145 LAST MONTH analytical events
  • 21.
  • 22. - Run pipelines (scheduled and ad-hoc) - Provide python DSL - Provide integrations with third party systems (hive, presto, spark, ...) - Not compute engines - Good for batch execution - Not for data streaming Orchestration engines at Lyft
  • 23. - Run pipelines - Provide integrations with thirt party systems (hive, presto, …) - Not a compute engines - Good for batch execution - Not for data streaming Orchestration engines at Lyft
  • 24. What is the difference between Flyte and Airflow? Why created Flyte? Why do we use both? Should I use Flyte or Airflow for my project?
  • 25.
  • 28. - Quick and simple to start - Many integrations (operators) - Good support for sensor tasks - Monolithic - Fixed set of workers - Does not manage infrastructure
  • 29. No multi tenancy - No environment separation: DEV, STAGE, PROD - Impossible to set up a custom libraries and dependencies per DAG Limited functionality - No versioning of DAGs (no way to compare outputs of version A vs B) - No caching of task results (Airflow is not data aware)
  • 30. Monolithic scheduler - Centralized scheduler becomes a bottleneck No resource management - Heavy tasks may overwhelm worker - Impossible to set resource quotas per task like max memory/CPU
  • 31. TARS - Airflow development environment for testing and backfilling - Kubernetes pod with ETL software and libraries and CLI tools (TARS is also Interstellar movie robot)
  • 32. Good tool for classic ETLs using a standard set of operators and orchestrating third-party systems when custom environment and multi tenancy are not required
  • 33.
  • 34. Q1 2021 Flyte was donated to LF AI & Data Foundation Union.ai started Q2 2020 Spotify and Freenome join Flyte as collaborators Q3 2021 15 collaborator organizations 100+ contributors Spotify contributes flytekit-java Nov, 2019 Flyte was open sourced at Kubecon! flyte.org Nov, 2016 Flyte V0 built for ETA team at Lyft
  • 35. Workspace is organized into projects Projects or individual tasks have different environment Projects are organized into domains: development, staging, production Multi tenancy
  • 36. Flyte workflows Language agnostic: can be written in python and java, any docker image can be a task Versioned: each version is a separate docker image Data aware: strong typing for inputs and outputs, Flyte executes tasks based on data dependencies, results can be cached
  • 39.
  • 40. - Task execution and resource isolation is managed by Kubernetes - Throttling and queueing is handled by Flyte propeller - Multi tenant - Allows to have isolated environment per task or project - Supports workflow versioning - Data aware, can cache task results
  • 41. Overhead - Ephemeral infrastructure brings a startup time overhead - Teams needs to support their docker images Anti patterns - Table sensing is done much more elegant in Airflow (use event-driven approach) - Not suited for a complex parallel computation
  • 42. Good tool if you need a multi tenant environment, custom dependencies per task or project, workflow versioning, and compute isolation is required
  • 43. - Good for classic ETL jobs - Quick and simple to start - Good support for table sensing - No multi tenancy - No workflow versioning - Monolithic, fixed set of workers - No infrastructure and environment isolation - Good for a custom jobs (like ML) - Multi tenant, api-friendly - Supports workflow versioning - Overhead with ephemeral infrastructure and image maintenance - Infrastructure and environment isolation based on Kubernetes
  • 44. Choose the right tool for the right job* (Some cases can be implemented well on both engines)
  • 46. Please ask questions in the chat! The best question gets something special from our speakers
  • 48. Now Hiring in Minsk and Kyiv! Backend, Data, ML Engineers And many more on our careers page! (lyft.com/careers) Join the Ride!

Editor's Notes

  1. Basemap Collection (imagery): Yury Kunitski Lev Dragunov Map delivery: Thom Dedecko There is lots of good detection imagery for closures here: https://docs.google.com/document/d/123rpa5nop5OtjLhq-YvUcF6V9jAYgqSu39ayYb7SQ0k/edit and you can find some systems overview stuff here, including image of our fleetview camera: https://docs.google.com/presentation/d/11jWVR_EZ8Vgar2yBAwHmc3CwZLRbdmfjq20sycDFCno/edit#slide=id.g2d1a3323cd_0_58
  2. Journey Rendezvous Xiaomeng Chen
  3. Journey Data Andrey Kravtsov
  4. Mapping (Localization) Artsem Semianenka
  5. Market signals, forecasting Matthew Smith
  6. Market signals, forecasting Matthew Smith
  7. Market signals, forecasting Matthew Smith
  8. https://confluence.lyft.net/pages/viewpage.action?pageId=207326078
  9. https://blog.container-solutions.com/kubernetes-operators-explained ‘Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components. Operators follow Kubernetes principles, notably the control loop’. https://kubernetes.io/docs/concepts/extend-kubernetes/operator/