The Lyft data platform: Now and in the future

M
Now and The Future
Lyft Data Platform
Mark Grover | @mark_grover
Deepak Tiwari | @_deepaktiwari_
Improve people’s lives with the world’s best transportation
● 30.7M riders in 2018
● 1.9M drivers in 2018
● 1B+ cumulative rides
● 300+ markets in US &
Canada
Data is at the core of decisions at Lyft
Automated decisions
- What’s the price for the ride?
- What driver to match?
- What’s the ETA?
Analyzing business performance
- How are key business metrics
trending?
- How do predicted ETAs compare
to actual?
Human business decisions
- Which opportunities to invest in?
- Which path to take (via
experimentation)?
Data platform users
4
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersPMs/Execs
Analytics Biz ops Building apps Experimentation
By numbers...
● Millions of BI queries per
week doubling quarterly
● 5X increase in productivity
of ML models in 2018
● 20X scaling of support of
maps to users through
streaming platform in 2018
Product Teams,
Applied ML, Forecasting
ML Platform
Data Platform and Infra
Source: The AI Hierarchy of Needs, Monica Rogati (8/2017)
Data as a platform to accelerate the business and reduce risk...
● Think ahead in the future (e.g. streaming, machine learning,
security and privacy, visualization, discovery, etc.).
● Provide a step change (vs incremental) in the capability.
● Move fast.
● Create a competitive advantage.
● Focus on impact: Develop jointly with application verticals.
● Build enterprise grade platform.
● Have a clearly defined contract with applications (e.g. SLAs).
● Give a serverless application for the product teams.
Guiding principles for the data platform team...
Innovative
Impactful
Reliable
Use case #1
Unmet need for business metric observability
Business metric observability
What’s the health of the business?
Grafana
Operational observability
What’s the health of the service?
● Is the service up?
● Is it throwing errors?
● In near real-time (< 1 min)
Requirements for biz metric observability
See results within
1 - 30 minutes
Be the source of truthNear real-time
Impact on
business metrics
Derive business metrics
from raw data (aka ETL)
Don’t widen the gap for
reconciliation
11
Project F2 architecture
Data Discovery
app - Amundsen
Operational Data
stores (e.g.
Dynamo)
Apache Superset
CDC
Online flow Offline
flow
Magic of CDC - Change Data Capture
Operational Data
stores (e.g. Dynamo)
Analytical Data stores
(e.g. Hive/Presto, BQ)
1. Tail the operational
Data stores
2. Persist the
raw change log
3. Upsert the
change log to
table periodically
(~30 m)
Advantages of CDC
Data Engineer
Productivity
See results within
30 minutes
Near real-time
Source of truth
No need to reconcile
Same data as operational
DBs
No need to recreate ETL
from events
Easier primitives to build
ETL on top of
● Measuring reliability
○ How to distinguish late arriving data from missing data?
○ How do you trace a single missing revision through all moving parts?
● Lots of moving parts
○ Tailer, tied to implementation of operational DB
○ Ingest pipeline
○ Kafka, Kinesis
○ Analytic Database
Challenges of the architecture
CDC + Streaming =
Lots of business
value
Use case #2
Data Science use cases - Driver app
Data Science use cases - Pricing
Requirements for streaming applications
In Streaming, just like in Batch
Quick and simple ways of cleaning data
Prototype in a language of
choice (Python, R, SQL)
Quick and simple ways of cleaning data
20
Services (e.g.
ETA, Pricing)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
Streaming architecture
Investments in Streaming
Dryft
Fully managed data processing
engine, powering real-time
features and events
- Needed for consistent feature
generation
- Batch processing for bulk
creation of features for training
- Stream processing for
real-time creation of features for
scoring
- Uses Flink SQL under the
hood
Apache Beam
Open source unified, portable
and extensible model for both
batch and streaming use-cases
- Enables streaming use cases
for teams using non JVM
languages
- Uses Flink under the hood
● Things we find at scale
○ Intermittent AWS service errors
○ Can’t be naive about pub-sub consumption
● Integration
○ Things work in isolation, but …
○ Flink Kinesis Connector
■ Connector that work at scale are hard
Challenges of the architecture
Sharing your batch
and streaming
compute will pay
huge benefits
The whole
shebang
25
Data Platform architecture
Data Discovery
app - Amundsen
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Apache Superset
BI/Data Viz
Marketplace
Operations app
...
Other custom
apps
Custom apps
Flyte
Kafka is better but ….
• Has limitations around fan-in
Kafka vs. Kinesis
Kinesis scaling limitations
• We require high throughput & high fan-out
• Default limit of 500 shards
• Resharding is expensive and slow
• Built a fan-out system to work around
limitations
● Apache Flink vs. Apache Spark vs. Apache Beam
● 2 dimensions of comparison
○ APIs (the kinds of applications you can write)
○ Operations (the kind of applications you can support)
● Apache Beam for multi-language support (Python and Go)
● Spark Streaming - operations were hard, no state evolution, cumulative
latencies with multi-stage graphs.
● Know when to put all your eggs in the same basket (Spark), when not to.
Streaming engines
Interactive querying:
● Redshift
○ Historical but dying
● Druid
○ Interactive use-cases
● Presto (on S3)
○ Super handy interactive query engine
○ Lacking real-time ingestion support
● BigQuery
○ Interactive query engine (like Presto)
○ Expensive, but great streaming support!
ETL:
● Hive (on S3)
○ Mostly for ETL and adhoc queries that are too large to run on Presto
● Spark
○ Some ETL, potential for all ETL to be in Spark
Data Storage and processing
Future of Interactive querying
Unified access layer
e.g. DAL, Genie, DALi Views
Future of ETL
- Easily schedule with dependencies, a
SQL query to be an ETL job
- Diagnose job failures with lineage and
dashboards on data skew, etc.
● Airflow
○ Most ETL jobs
○ Python heavy DAGs
○ Really good community to support
● Flyte
○ Focussed on ML workflows
○ Built in Provenance
○ Intermediate caching, discovery of previously computed artifacts
Workflow engines
Conclusion
● We think about data as a platform and a competitive advantage.
● Our data and platform usage is growing really really fast.
● We support Data Science, Ops, Analytics, Experimentation and other
use cases.
● We have seen tremendous benefit from CDC data + Streaming
frameworks to deliver business metric observability.
● We have learned and gained a lot in operational excellence by
sharing our batch and stream compute frameworks.
● We are investing in Data Discovery, Streaming, and Machine
Learning.
Conclusion
Attend Streaming at Lyft session tomorrow!
Attend Meetup at Level39 tonight!
Thank you
go.lyft.com/lyftdataplatformMay 2nd, 2019
Mark Grover | @mark_grover
Deepak Tiwari | @_deepaktiwari_
1 of 34

Recommended

The delta architecture by
The delta architectureThe delta architecture
The delta architecturePrakash Chockalingam
538 views41 slides
A Thorough Comparison of Delta Lake, Iceberg and Hudi by
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
11.1K views27 slides
Data Lakehouse, Data Mesh, and Data Fabric (r2) by
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
6.3K views30 slides
Near real-time anomaly detection at Lyft by
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyftmarkgrover
1.9K views45 slides
MLOps Virtual Event: Automating ML at Scale by
MLOps Virtual Event: Automating ML at ScaleMLOps Virtual Event: Automating ML at Scale
MLOps Virtual Event: Automating ML at ScaleDatabricks
684 views12 slides
Data Engineering Efficiency @ Netflix - Strata 2017 by
Data Engineering Efficiency @ Netflix - Strata 2017Data Engineering Efficiency @ Netflix - Strata 2017
Data Engineering Efficiency @ Netflix - Strata 2017Michelle Ufford
4.3K views45 slides

More Related Content

What's hot

Introduction to Streaming Analytics by
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
4.1K views115 slides
Welcome & AWS Big Data Solution Overview by
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
7.3K views52 slides
Netflix Data Engineering @ Uber Engineering Meetup by
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupBlake Irvine
4.7K views36 slides
How Uber scaled its Real Time Infrastructure to Trillion events per day by
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
27.6K views40 slides
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... by
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
35.3K views54 slides
Scaling Data and ML with Apache Spark and Feast by
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and FeastDatabricks
1.2K views35 slides

What's hot(20)

Introduction to Streaming Analytics by Guido Schmutz
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
Guido Schmutz4.1K views
Netflix Data Engineering @ Uber Engineering Meetup by Blake Irvine
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Blake Irvine4.7K views
How Uber scaled its Real Time Infrastructure to Trillion events per day by DataWorks Summit
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
DataWorks Summit27.6K views
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... by Databricks
 Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa... Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Databricks35.3K views
Scaling Data and ML with Apache Spark and Feast by Databricks
Scaling Data and ML with Apache Spark and FeastScaling Data and ML with Apache Spark and Feast
Scaling Data and ML with Apache Spark and Feast
Databricks1.2K views
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines by Eric Kavanagh
Best Practices in DataOps: How to Create Agile, Automated Data PipelinesBest Practices in DataOps: How to Create Agile, Automated Data Pipelines
Best Practices in DataOps: How to Create Agile, Automated Data Pipelines
Eric Kavanagh675 views
Zipline—Airbnb’s Declarative Feature Engineering Framework by Databricks
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
Databricks3K views
Stream processing with Apache Flink (Timo Walther - Ververica) by KafkaZone
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone606 views
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud by Noritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama33.3K views
Module 2 - Datalake by Lam Le
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
Lam Le275 views
Slides: Success Stories for Data-to-Cloud by DATAVERSITY
Slides: Success Stories for Data-to-CloudSlides: Success Stories for Data-to-Cloud
Slides: Success Stories for Data-to-Cloud
DATAVERSITY563 views
Architect’s Open-Source Guide for a Data Mesh Architecture by Databricks
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks3.1K views
DataOps: An Agile Method for Data-Driven Organizations by Ellen Friedman
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
Ellen Friedman2.3K views
Building a unified data pipeline in Apache Spark by DataWorks Summit
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
DataWorks Summit26.2K views
Azure data platform overview by James Serra
Azure data platform overviewAzure data platform overview
Azure data platform overview
James Serra19.4K views

Similar to The Lyft data platform: Now and in the future

Real time analytics on deep learning @ strata data 2019 by
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
1K views46 slides
Machine learning and big data @ uber a tale of two systems by
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsZhenxiao Luo
2.2K views31 slides
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz... by
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...HostedbyConfluent
307 views40 slides
Processing 19 billion messages in real time and NOT dying in the process by
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the processJampp
479 views24 slides
Streaming in the Wild with Apache Flink by
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
1.8K views37 slides
Streaming in the Wild with Apache Flink by
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkDataWorks Summit/Hadoop Summit
2K views39 slides

Similar to The Lyft data platform: Now and in the future(20)

Real time analytics on deep learning @ strata data 2019 by Zhenxiao Luo
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo1K views
Machine learning and big data @ uber a tale of two systems by Zhenxiao Luo
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo2.2K views
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz... by HostedbyConfluent
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent307 views
Processing 19 billion messages in real time and NOT dying in the process by Jampp
Processing 19 billion messages in real time and NOT dying in the processProcessing 19 billion messages in real time and NOT dying in the process
Processing 19 billion messages in real time and NOT dying in the process
Jampp479 views
Streaming in the Wild with Apache Flink by Kostas Tzoumas
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
Kostas Tzoumas1.8K views
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft by Constantine Slisenka
Lyft talks #4 Orchestrating big data and ML pipelines at LyftLyft talks #4 Orchestrating big data and ML pipelines at Lyft
Lyft talks #4 Orchestrating big data and ML pipelines at Lyft
[WSO2Con EU 2018] The Rise of Streaming SQL by WSO2
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
WSO2291 views
Scaling up uber's real time data analytics by Xiang Fu
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
Xiang Fu758 views
Lambda Architecture and open source technology stack for real time big data by Trieu Nguyen
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
Trieu Nguyen8K views
LeedsSharp May 2023 - Azure Integration Services by Michael Stephenson
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
Data Virtualization Journey: How to Grow from Single Project and to Enterpris... by Denodo
Data Virtualization Journey: How to Grow from Single Project and to Enterpris...Data Virtualization Journey: How to Grow from Single Project and to Enterpris...
Data Virtualization Journey: How to Grow from Single Project and to Enterpris...
Denodo 573 views
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti... by Kai Wähner
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Kai Wähner2.9K views
Apache Kafka® + Machine Learning for Supply Chain  by confluent
Apache Kafka® + Machine Learning for Supply Chain Apache Kafka® + Machine Learning for Supply Chain 
Apache Kafka® + Machine Learning for Supply Chain 
confluent2.6K views
Continuous Intelligence - Intersecting Event-Based Business Logic and ML by Paris Carbone
Continuous Intelligence - Intersecting Event-Based Business Logic and MLContinuous Intelligence - Intersecting Event-Based Business Logic and ML
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Paris Carbone317 views

More from markgrover

From discovering to trusting data by
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
423 views50 slides
Amundsen lineage designs - community meeting, Dec 2020 by
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
386 views19 slides
Amundsen at Brex and Looker integration by
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
436 views28 slides
REA Group's journey with Data Cataloging and Amundsen by
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
231 views19 slides
Amundsen gremlin proxy design by
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
217 views19 slides
Amundsen: From discovering to security data by
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
341 views68 slides

More from markgrover(20)

From discovering to trusting data by markgrover
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
markgrover423 views
Amundsen lineage designs - community meeting, Dec 2020 by markgrover
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
markgrover386 views
Amundsen at Brex and Looker integration by markgrover
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
markgrover436 views
REA Group's journey with Data Cataloging and Amundsen by markgrover
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
markgrover231 views
Amundsen gremlin proxy design by markgrover
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
markgrover217 views
Amundsen: From discovering to security data by markgrover
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover341 views
Amundsen: From discovering to security data by markgrover
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
markgrover134 views
Data Discovery & Trust through Metadata by markgrover
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
markgrover212 views
Data Discovery and Metadata by markgrover
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
markgrover610 views
Disrupting Data Discovery by markgrover
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover2.1K views
TensorFlow Extension (TFX) and Apache Beam by markgrover
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
markgrover793 views
Big Data at Speed by markgrover
Big Data at SpeedBig Data at Speed
Big Data at Speed
markgrover244 views
Dogfooding data at Lyft by markgrover
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
markgrover493 views
Fighting cybersecurity threats with Apache Spot by markgrover
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
markgrover1.4K views
Fraud Detection with Hadoop by markgrover
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
markgrover2.3K views
Top 5 mistakes when writing Spark applications by markgrover
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover394 views
Top 5 mistakes when writing Spark applications by markgrover
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
markgrover849 views
Architecting Applications with Hadoop by markgrover
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover765 views
SQL Engines for Hadoop - The case for Impala by markgrover
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover1.2K views
Intro to hadoop tutorial by markgrover
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
markgrover914 views

Recently uploaded

"Running students' code in isolation. The hard way", Yurii Holiuk by
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
17 views34 slides
Evolving the Network Automation Journey from Python to Platforms by
Evolving the Network Automation Journey from Python to PlatformsEvolving the Network Automation Journey from Python to Platforms
Evolving the Network Automation Journey from Python to PlatformsNetwork Automation Forum
13 views21 slides
Serverless computing with Google Cloud (2023-24) by
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)wesley chun
11 views33 slides
Case Study Copenhagen Energy and Business Central.pdf by
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdfAitana
16 views3 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5DianaGray10
300 views20 slides
Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
38 views15 slides

Recently uploaded(20)

"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays17 views
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10300 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Powerful Google developer tools for immediate impact! (2023-24) by wesley chun
Powerful Google developer tools for immediate impact! (2023-24)Powerful Google developer tools for immediate impact! (2023-24)
Powerful Google developer tools for immediate impact! (2023-24)
wesley chun10 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex10 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software280 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely25 views

The Lyft data platform: Now and in the future

  • 1. Now and The Future Lyft Data Platform Mark Grover | @mark_grover Deepak Tiwari | @_deepaktiwari_
  • 2. Improve people’s lives with the world’s best transportation ● 30.7M riders in 2018 ● 1.9M drivers in 2018 ● 1B+ cumulative rides ● 300+ markets in US & Canada
  • 3. Data is at the core of decisions at Lyft Automated decisions - What’s the price for the ride? - What driver to match? - What’s the ETA? Analyzing business performance - How are key business metrics trending? - How do predicted ETAs compare to actual? Human business decisions - Which opportunities to invest in? - Which path to take (via experimentation)?
  • 4. Data platform users 4 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersPMs/Execs Analytics Biz ops Building apps Experimentation
  • 5. By numbers... ● Millions of BI queries per week doubling quarterly ● 5X increase in productivity of ML models in 2018 ● 20X scaling of support of maps to users through streaming platform in 2018
  • 6. Product Teams, Applied ML, Forecasting ML Platform Data Platform and Infra Source: The AI Hierarchy of Needs, Monica Rogati (8/2017) Data as a platform to accelerate the business and reduce risk...
  • 7. ● Think ahead in the future (e.g. streaming, machine learning, security and privacy, visualization, discovery, etc.). ● Provide a step change (vs incremental) in the capability. ● Move fast. ● Create a competitive advantage. ● Focus on impact: Develop jointly with application verticals. ● Build enterprise grade platform. ● Have a clearly defined contract with applications (e.g. SLAs). ● Give a serverless application for the product teams. Guiding principles for the data platform team... Innovative Impactful Reliable
  • 9. Unmet need for business metric observability Business metric observability What’s the health of the business? Grafana Operational observability What’s the health of the service? ● Is the service up? ● Is it throwing errors? ● In near real-time (< 1 min)
  • 10. Requirements for biz metric observability See results within 1 - 30 minutes Be the source of truthNear real-time Impact on business metrics Derive business metrics from raw data (aka ETL) Don’t widen the gap for reconciliation
  • 11. 11 Project F2 architecture Data Discovery app - Amundsen Operational Data stores (e.g. Dynamo) Apache Superset CDC Online flow Offline flow
  • 12. Magic of CDC - Change Data Capture Operational Data stores (e.g. Dynamo) Analytical Data stores (e.g. Hive/Presto, BQ) 1. Tail the operational Data stores 2. Persist the raw change log 3. Upsert the change log to table periodically (~30 m)
  • 13. Advantages of CDC Data Engineer Productivity See results within 30 minutes Near real-time Source of truth No need to reconcile Same data as operational DBs No need to recreate ETL from events Easier primitives to build ETL on top of
  • 14. ● Measuring reliability ○ How to distinguish late arriving data from missing data? ○ How do you trace a single missing revision through all moving parts? ● Lots of moving parts ○ Tailer, tied to implementation of operational DB ○ Ingest pipeline ○ Kafka, Kinesis ○ Analytic Database Challenges of the architecture
  • 15. CDC + Streaming = Lots of business value
  • 17. Data Science use cases - Driver app
  • 18. Data Science use cases - Pricing
  • 19. Requirements for streaming applications In Streaming, just like in Batch Quick and simple ways of cleaning data Prototype in a language of choice (Python, R, SQL) Quick and simple ways of cleaning data
  • 20. 20 Services (e.g. ETA, Pricing) Models + Applications (e.g. ETA, Pricing) Flyte Streaming architecture
  • 21. Investments in Streaming Dryft Fully managed data processing engine, powering real-time features and events - Needed for consistent feature generation - Batch processing for bulk creation of features for training - Stream processing for real-time creation of features for scoring - Uses Flink SQL under the hood Apache Beam Open source unified, portable and extensible model for both batch and streaming use-cases - Enables streaming use cases for teams using non JVM languages - Uses Flink under the hood
  • 22. ● Things we find at scale ○ Intermittent AWS service errors ○ Can’t be naive about pub-sub consumption ● Integration ○ Things work in isolation, but … ○ Flink Kinesis Connector ■ Connector that work at scale are hard Challenges of the architecture
  • 23. Sharing your batch and streaming compute will pay huge benefits
  • 25. 25 Data Platform architecture Data Discovery app - Amundsen Services (e.g. ETA, Pricing) Operational Data stores (e.g. Dynamo) Models + Applications (e.g. ETA, Pricing) Apache Superset BI/Data Viz Marketplace Operations app ... Other custom apps Custom apps Flyte
  • 26. Kafka is better but …. • Has limitations around fan-in Kafka vs. Kinesis Kinesis scaling limitations • We require high throughput & high fan-out • Default limit of 500 shards • Resharding is expensive and slow • Built a fan-out system to work around limitations
  • 27. ● Apache Flink vs. Apache Spark vs. Apache Beam ● 2 dimensions of comparison ○ APIs (the kinds of applications you can write) ○ Operations (the kind of applications you can support) ● Apache Beam for multi-language support (Python and Go) ● Spark Streaming - operations were hard, no state evolution, cumulative latencies with multi-stage graphs. ● Know when to put all your eggs in the same basket (Spark), when not to. Streaming engines
  • 28. Interactive querying: ● Redshift ○ Historical but dying ● Druid ○ Interactive use-cases ● Presto (on S3) ○ Super handy interactive query engine ○ Lacking real-time ingestion support ● BigQuery ○ Interactive query engine (like Presto) ○ Expensive, but great streaming support! ETL: ● Hive (on S3) ○ Mostly for ETL and adhoc queries that are too large to run on Presto ● Spark ○ Some ETL, potential for all ETL to be in Spark Data Storage and processing Future of Interactive querying Unified access layer e.g. DAL, Genie, DALi Views Future of ETL - Easily schedule with dependencies, a SQL query to be an ETL job - Diagnose job failures with lineage and dashboards on data skew, etc.
  • 29. ● Airflow ○ Most ETL jobs ○ Python heavy DAGs ○ Really good community to support ● Flyte ○ Focussed on ML workflows ○ Built in Provenance ○ Intermediate caching, discovery of previously computed artifacts Workflow engines
  • 31. ● We think about data as a platform and a competitive advantage. ● Our data and platform usage is growing really really fast. ● We support Data Science, Ops, Analytics, Experimentation and other use cases. ● We have seen tremendous benefit from CDC data + Streaming frameworks to deliver business metric observability. ● We have learned and gained a lot in operational excellence by sharing our batch and stream compute frameworks. ● We are investing in Data Discovery, Streaming, and Machine Learning. Conclusion
  • 32. Attend Streaming at Lyft session tomorrow!
  • 33. Attend Meetup at Level39 tonight!
  • 34. Thank you go.lyft.com/lyftdataplatformMay 2nd, 2019 Mark Grover | @mark_grover Deepak Tiwari | @_deepaktiwari_