Streaming at Lyft - Greg Fee - Seattle Apache Flink Meetup

Seattle Apache Flink Meetup
Seattle Apache Flink MeetupSeattle Apache Flink Meetup
Streaming@Lyft
Flink Meetup June 2018
Gregory Fee
About Me
● Engineer @ Lyft
● Teams - ETA, Data Science Platform, Data Platform
● Accomplishments
○ ETA model training from 4 months to every 10 minutes
○ Real-time traffic updates
○ Flyte - Large Scale Orchestration and Batch Compute
○ Lyftlearn - Custom Machine Learning Library
○ Dryft - Real-time Feature Generation for Machine Learning
Streaming Use Cases
● Firehose -> S3 for Hive and
Presto
● Real-time Traffic
● Primetime + Heatmaps
● Fraud Detection
● Ride Receipts
● Driver incentives
● Passenger coupons
● Anomaly Detection
● Map Correction
● Location Smoothing
● Bad Experience Detection
● Accident Detection
● ...and so much more
A Brief History of Lyft
● <2015 - monolithic PHP app, Redshift
● 2015 - Python services, “workers”, Fanner,
Spark, Job Scheduler
● 2016 - Go services, Hive
● 2017 - Presto
● 2018 - Druid, Kafka, Flink, Beam
Streaming Architecture Overview
Mobile
Services
Ingest/
Enrich
Fanner
KCL
Job
Scheduler
S3
Fanner
● Pub-sub layer on Kinesis
● Curates output streams based on event type and simple value
filters
● Pros - easy to use, integrated with JobScheduler
● Cons - no ordering guarantees, limited scaling
Kinesis
All Events
Fanner
Kinesis
Event A
Kinesis
Event B
Kinesis
Event A + B
Job Scheduler
● Register data to call webhook
● Fire and forget
● Immediate or scheduled
● Pros - easy asynchronous
programming, scales well
● Cons - no ordering guarantees,
suboptimal outage handling, no replay
Service
SQS
Workers
Target Service
“Workers”
● Simple asynchronous/stream programming
● KCL worker to grab data from Kinesis,
transform, store in Redis
● Additional workers read from Redis,
transform, store in Redis
● Pros - familiar programming paradigm
● Cons - failure handling, checkpointing,
joins, replay, etc. are roll your own (aka
error prone)
Next Generation Goals
● Build Community
● Lower support burden
● Enhance developer ergonomics
● Scale with the business
● Support additional use cases
Next Generation Overview
● Pub-Sub -> Kafka
○ Large community, mature technology
● Stream Processing -> Flink
○ Growing community, event time processing
● Apache Beam for cross language
support
● SPaaS
○ Kubernetes
○ Dryft
Dryft
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - Flink SQL
○ Use Flink as the processing engine using streaming or bulk data
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale
Dryft Program
{
"source": "dryft",
"query_file": "decl_ride_completed.sql",
"kinesis": {
"stream": "declridecompleted" },
"features": {
"n_total_rides": {
"description": "All time ride count per
user",
"type": "int",
"version": 1 }
}
}
SELECT COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1) AS
user_id,
COUNT(ride_id) as n_total_rides
FROM event_ride_completed
GROUP BY COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1)
Dryft Program Execution
● Backfill - read historic data from S3, process, sink to S3
● Real-time - read stream data from Kinesis/Kafka, process, sink
to DynamoDB
SinkS3 Source SQL
SinkKinesis/Kafka Source SQL
Bootstrapping
● Read historic data from S3
● Transition to reading real-time data
● https://data-artisans.com/flink-forward/resources/bootstrappin
g-state-in-apache-flink
S3 Source
Kinesis/Kafka Source
Business
Logic
Sink
< Target Time
>= Target Time
Continuous Window Semantics
● Continuous window counts go up and down
● SELECT user_id, count(ride_id) OVER (PARTITION BY
user_id ORDER BY rowtime RANGE INTERVAL '1' HOUR
PRECEDING) from event_ride_completed
● Example: rides 1p, 1:30p, 2:15p, 3:45p
● Flink default is one message in, one message out
○ Output = 1@1p, 2@1:30p, 2@2:15p, 1@3:45p
● Dryft is one message in, two messages out
○ Output = 1@1p, 2@1:30p, 1@2p, 2@2:15p, 1@2:30p,
0@3:15p, 1@3:45p, 0@4:45p
● Other fancy SQL analysis tricks too
Beyond Feature Generation
● Reactive Programming
○ When an event occurs, execute some logic
● Asynchronous Programming
○ Perform more processing asynchronously
● Geotriggering
○ Emit an event when someone enters or leaves an
area
● Change Data Capture
○ Mirror production data in analytical data stores
The End of the Stream
● Real-time Visualization w/ Druid and
Superset
● Presto for <10s over large data sets
● Hive and Spark for extreme data sets
● Flyte - extreme data set workflows
Q&A
1 of 18

Recommended

Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -... by
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...
Approximate Queries and Graph Streams on Apache Flink - Theodore Vasiloudis -...Seattle Apache Flink Meetup
963 views49 slides
ChatGPT and the Future of Work - Clark Boyd by
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
26.2K views69 slides
Getting into the tech field. what next by
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
6.3K views22 slides
Google's Just Not That Into You: Understanding Core Updates & Search Intent by
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
6.7K views99 slides
How to have difficult conversations by
How to have difficult conversations How to have difficult conversations
How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC
5.4K views19 slides
Introduction to Data Science by
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceChristy Abraham Joy
82.5K views51 slides

More Related Content

Recently uploaded

REACTJS.pdf by
REACTJS.pdfREACTJS.pdf
REACTJS.pdfArthyR3
35 views16 slides
Proposal Presentation.pptx by
Proposal Presentation.pptxProposal Presentation.pptx
Proposal Presentation.pptxkeytonallamon
63 views36 slides
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf by
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfAlhamduKure
6 views11 slides
Design_Discover_Develop_Campaign.pptx by
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptxShivanshSeth6
45 views20 slides
dummy.pptx by
dummy.pptxdummy.pptx
dummy.pptxJamesLamp
5 views2 slides
Renewal Projects in Seismic Construction by
Renewal Projects in Seismic ConstructionRenewal Projects in Seismic Construction
Renewal Projects in Seismic ConstructionEngineering & Seismic Construction
5 views8 slides

Recently uploaded(20)

REACTJS.pdf by ArthyR3
REACTJS.pdfREACTJS.pdf
REACTJS.pdf
ArthyR335 views
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf by AlhamduKure
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdfASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
ASSIGNMENTS ON FUZZY LOGIC IN TRAFFIC FLOW.pdf
AlhamduKure6 views
Design_Discover_Develop_Campaign.pptx by ShivanshSeth6
Design_Discover_Develop_Campaign.pptxDesign_Discover_Develop_Campaign.pptx
Design_Discover_Develop_Campaign.pptx
ShivanshSeth645 views
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx by lwang78
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
2023Dec ASU Wang NETR Group Research Focus and Facility Overview.pptx
lwang78165 views
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth by Innomantra
BCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for GrowthBCIC - Manufacturing Conclave -  Technology-Driven Manufacturing for Growth
BCIC - Manufacturing Conclave - Technology-Driven Manufacturing for Growth
Innomantra 10 views
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc... by csegroupvn
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
Design of Structures and Foundations for Vibrating Machines, Arya-ONeill-Pinc...
csegroupvn6 views
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy34 views
MongoDB.pdf by ArthyR3
MongoDB.pdfMongoDB.pdf
MongoDB.pdf
ArthyR349 views
SUMIT SQL PROJECT SUPERSTORE 1.pptx by Sumit Jadhav
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptx
Sumit Jadhav 22 views
Créativité dans le design mécanique à l’aide de l’optimisation topologique by LIEGE CREATIVE
Créativité dans le design mécanique à l’aide de l’optimisation topologiqueCréativité dans le design mécanique à l’aide de l’optimisation topologique
Créativité dans le design mécanique à l’aide de l’optimisation topologique
LIEGE CREATIVE5 views
GDSC Mikroskil Members Onboarding 2023.pdf by gdscmikroskil
GDSC Mikroskil Members Onboarding 2023.pdfGDSC Mikroskil Members Onboarding 2023.pdf
GDSC Mikroskil Members Onboarding 2023.pdf
gdscmikroskil59 views

Featured

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright... by
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
12.7K views21 slides
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
55.5K views138 slides
12 Ways to Increase Your Influence at Work by
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
401.7K views64 slides
ChatGPT webinar slides by
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slidesAlireza Esmikhani
30.4K views36 slides
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
3.6K views12 slides

Featured(20)

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright... by RachelPearson36
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson3612.7K views
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present... by Applitools
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools55.5K views
12 Ways to Increase Your Influence at Work by GetSmarter
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter401.7K views
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G... by DevGAMM Conference
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference3.6K views
Barbie - Brand Strategy Presentation by Erica Santiago
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago25.1K views
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well by Saba Software
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software25.3K views
Introduction to C Programming Language by Simplilearn
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn8.4K views
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr... by Palo Alto Software
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
The Pixar Way: 37 Quotes on Developing and Maintaining a Creative Company (fr...
Palo Alto Software88.4K views
9 Tips for a Work-free Vacation by Weekdone.com
9 Tips for a Work-free Vacation9 Tips for a Work-free Vacation
9 Tips for a Work-free Vacation
Weekdone.com7.2K views
How to Map Your Future by SlideShop.com
How to Map Your FutureHow to Map Your Future
How to Map Your Future
SlideShop.com275.1K views
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -... by AccuraCast
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
Beyond Pride: Making Digital Marketing & SEO Authentically LGBTQ+ Inclusive -...
AccuraCast3.4K views
Exploring ChatGPT for Effective Teaching and Learning.pptx by Stan Skrabut, Ed.D.
Exploring ChatGPT for Effective Teaching and Learning.pptxExploring ChatGPT for Effective Teaching and Learning.pptx
Exploring ChatGPT for Effective Teaching and Learning.pptx
Stan Skrabut, Ed.D.57.7K views
How to train your robot (with Deep Reinforcement Learning) by Lucas García, PhD
How to train your robot (with Deep Reinforcement Learning)How to train your robot (with Deep Reinforcement Learning)
How to train your robot (with Deep Reinforcement Learning)
Lucas García, PhD42.5K views
4 Strategies to Renew Your Career Passion by Daniel Goleman
4 Strategies to Renew Your Career Passion4 Strategies to Renew Your Career Passion
4 Strategies to Renew Your Career Passion
Daniel Goleman122K views
The Student's Guide to LinkedIn by LinkedIn
The Student's Guide to LinkedInThe Student's Guide to LinkedIn
The Student's Guide to LinkedIn
LinkedIn88.1K views
Different Roles in Machine Learning Career by Intellipaat
Different Roles in Machine Learning CareerDifferent Roles in Machine Learning Career
Different Roles in Machine Learning Career
Intellipaat12.4K views

Streaming at Lyft - Greg Fee - Seattle Apache Flink Meetup

  • 2. About Me ● Engineer @ Lyft ● Teams - ETA, Data Science Platform, Data Platform ● Accomplishments ○ ETA model training from 4 months to every 10 minutes ○ Real-time traffic updates ○ Flyte - Large Scale Orchestration and Batch Compute ○ Lyftlearn - Custom Machine Learning Library ○ Dryft - Real-time Feature Generation for Machine Learning
  • 3. Streaming Use Cases ● Firehose -> S3 for Hive and Presto ● Real-time Traffic ● Primetime + Heatmaps ● Fraud Detection ● Ride Receipts ● Driver incentives ● Passenger coupons ● Anomaly Detection ● Map Correction ● Location Smoothing ● Bad Experience Detection ● Accident Detection ● ...and so much more
  • 4. A Brief History of Lyft ● <2015 - monolithic PHP app, Redshift ● 2015 - Python services, “workers”, Fanner, Spark, Job Scheduler ● 2016 - Go services, Hive ● 2017 - Presto ● 2018 - Druid, Kafka, Flink, Beam
  • 6. Fanner ● Pub-sub layer on Kinesis ● Curates output streams based on event type and simple value filters ● Pros - easy to use, integrated with JobScheduler ● Cons - no ordering guarantees, limited scaling Kinesis All Events Fanner Kinesis Event A Kinesis Event B Kinesis Event A + B
  • 7. Job Scheduler ● Register data to call webhook ● Fire and forget ● Immediate or scheduled ● Pros - easy asynchronous programming, scales well ● Cons - no ordering guarantees, suboptimal outage handling, no replay Service SQS Workers Target Service
  • 8. “Workers” ● Simple asynchronous/stream programming ● KCL worker to grab data from Kinesis, transform, store in Redis ● Additional workers read from Redis, transform, store in Redis ● Pros - familiar programming paradigm ● Cons - failure handling, checkpointing, joins, replay, etc. are roll your own (aka error prone)
  • 9. Next Generation Goals ● Build Community ● Lower support burden ● Enhance developer ergonomics ● Scale with the business ● Support additional use cases
  • 10. Next Generation Overview ● Pub-Sub -> Kafka ○ Large community, mature technology ● Stream Processing -> Flink ○ Growing community, event time processing ● Apache Beam for cross language support ● SPaaS ○ Kubernetes ○ Dryft
  • 11. Dryft ● Need - Consistent Feature Generation ○ The value of your machine learning results is only as good as the data ○ Subtle changes to how a feature value is generated can significantly impact results ● Solution - Unify feature generation ○ Batch processing for bulk creation of features for training ML models ○ Stream processing for real-time creation of features for scoring ML models ● How - Flink SQL ○ Use Flink as the processing engine using streaming or bulk data ○ Add automation to make it super simple to launch and maintain feature generation programs at scale
  • 12. Dryft Program { "source": "dryft", "query_file": "decl_ride_completed.sql", "kinesis": { "stream": "declridecompleted" }, "features": { "n_total_rides": { "description": "All time ride count per user", "type": "int", "version": 1 } } } SELECT COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1) AS user_id, COUNT(ride_id) as n_total_rides FROM event_ride_completed GROUP BY COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1)
  • 13. Dryft Program Execution ● Backfill - read historic data from S3, process, sink to S3 ● Real-time - read stream data from Kinesis/Kafka, process, sink to DynamoDB SinkS3 Source SQL SinkKinesis/Kafka Source SQL
  • 14. Bootstrapping ● Read historic data from S3 ● Transition to reading real-time data ● https://data-artisans.com/flink-forward/resources/bootstrappin g-state-in-apache-flink S3 Source Kinesis/Kafka Source Business Logic Sink < Target Time >= Target Time
  • 15. Continuous Window Semantics ● Continuous window counts go up and down ● SELECT user_id, count(ride_id) OVER (PARTITION BY user_id ORDER BY rowtime RANGE INTERVAL '1' HOUR PRECEDING) from event_ride_completed ● Example: rides 1p, 1:30p, 2:15p, 3:45p ● Flink default is one message in, one message out ○ Output = 1@1p, 2@1:30p, 2@2:15p, 1@3:45p ● Dryft is one message in, two messages out ○ Output = 1@1p, 2@1:30p, 1@2p, 2@2:15p, 1@2:30p, 0@3:15p, 1@3:45p, 0@4:45p ● Other fancy SQL analysis tricks too
  • 16. Beyond Feature Generation ● Reactive Programming ○ When an event occurs, execute some logic ● Asynchronous Programming ○ Perform more processing asynchronously ● Geotriggering ○ Emit an event when someone enters or leaves an area ● Change Data Capture ○ Mirror production data in analytical data stores
  • 17. The End of the Stream ● Real-time Visualization w/ Druid and Superset ● Presto for <10s over large data sets ● Hive and Spark for extreme data sets ● Flyte - extreme data set workflows
  • 18. Q&A