Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Streaming@Lyft
Flink Meetup June 2018
Gregory Fee
About Me
● Engineer @ Lyft
● Teams - ETA, Data Science Platform, Data Platform
● Accomplishments
○ ETA model training from...
Streaming Use Cases
● Firehose -> S3 for Hive and
Presto
● Real-time Traffic
● Primetime + Heatmaps
● Fraud Detection
● Ri...
A Brief History of Lyft
● <2015 - monolithic PHP app, Redshift
● 2015 - Python services, “workers”, Fanner,
Spark, Job Sch...
Streaming Architecture Overview
Mobile
Services
Ingest/
Enrich
Fanner
KCL
Job
Scheduler
S3
Fanner
● Pub-sub layer on Kinesis
● Curates output streams based on event type and simple value
filters
● Pros - easy to u...
Job Scheduler
● Register data to call webhook
● Fire and forget
● Immediate or scheduled
● Pros - easy asynchronous
progra...
“Workers”
● Simple asynchronous/stream programming
● KCL worker to grab data from Kinesis,
transform, store in Redis
● Add...
Next Generation Goals
● Build Community
● Lower support burden
● Enhance developer ergonomics
● Scale with the business
● ...
Next Generation Overview
● Pub-Sub -> Kafka
○ Large community, mature technology
● Stream Processing -> Flink
○ Growing co...
Dryft
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Su...
Dryft Program
{
"source": "dryft",
"query_file": "decl_ride_completed.sql",
"kinesis": {
"stream": "declridecompleted" },
...
Dryft Program Execution
● Backfill - read historic data from S3, process, sink to S3
● Real-time - read stream data from K...
Bootstrapping
● Read historic data from S3
● Transition to reading real-time data
● https://data-artisans.com/flink-forwar...
Continuous Window Semantics
● Continuous window counts go up and down
● SELECT user_id, count(ride_id) OVER (PARTITION BY
...
Beyond Feature Generation
● Reactive Programming
○ When an event occurs, execute some logic
● Asynchronous Programming
○ P...
The End of the Stream
● Real-time Visualization w/ Druid and
Superset
● Presto for <10s over large data sets
● Hive and Sp...
Q&A
Upcoming SlideShare
Loading in …5
×

Streaming at Lyft - Greg Fee - Seattle Apache Flink Meetup

1,449 views

Published on

Streaming at Lyft - Greg Fee - Seattle Apache Flink Meetup 06/28/2018

Published in: Engineering
  • Be the first to comment

Streaming at Lyft - Greg Fee - Seattle Apache Flink Meetup

  1. 1. Streaming@Lyft Flink Meetup June 2018 Gregory Fee
  2. 2. About Me ● Engineer @ Lyft ● Teams - ETA, Data Science Platform, Data Platform ● Accomplishments ○ ETA model training from 4 months to every 10 minutes ○ Real-time traffic updates ○ Flyte - Large Scale Orchestration and Batch Compute ○ Lyftlearn - Custom Machine Learning Library ○ Dryft - Real-time Feature Generation for Machine Learning
  3. 3. Streaming Use Cases ● Firehose -> S3 for Hive and Presto ● Real-time Traffic ● Primetime + Heatmaps ● Fraud Detection ● Ride Receipts ● Driver incentives ● Passenger coupons ● Anomaly Detection ● Map Correction ● Location Smoothing ● Bad Experience Detection ● Accident Detection ● ...and so much more
  4. 4. A Brief History of Lyft ● <2015 - monolithic PHP app, Redshift ● 2015 - Python services, “workers”, Fanner, Spark, Job Scheduler ● 2016 - Go services, Hive ● 2017 - Presto ● 2018 - Druid, Kafka, Flink, Beam
  5. 5. Streaming Architecture Overview Mobile Services Ingest/ Enrich Fanner KCL Job Scheduler S3
  6. 6. Fanner ● Pub-sub layer on Kinesis ● Curates output streams based on event type and simple value filters ● Pros - easy to use, integrated with JobScheduler ● Cons - no ordering guarantees, limited scaling Kinesis All Events Fanner Kinesis Event A Kinesis Event B Kinesis Event A + B
  7. 7. Job Scheduler ● Register data to call webhook ● Fire and forget ● Immediate or scheduled ● Pros - easy asynchronous programming, scales well ● Cons - no ordering guarantees, suboptimal outage handling, no replay Service SQS Workers Target Service
  8. 8. “Workers” ● Simple asynchronous/stream programming ● KCL worker to grab data from Kinesis, transform, store in Redis ● Additional workers read from Redis, transform, store in Redis ● Pros - familiar programming paradigm ● Cons - failure handling, checkpointing, joins, replay, etc. are roll your own (aka error prone)
  9. 9. Next Generation Goals ● Build Community ● Lower support burden ● Enhance developer ergonomics ● Scale with the business ● Support additional use cases
  10. 10. Next Generation Overview ● Pub-Sub -> Kafka ○ Large community, mature technology ● Stream Processing -> Flink ○ Growing community, event time processing ● Apache Beam for cross language support ● SPaaS ○ Kubernetes ○ Dryft
  11. 11. Dryft ● Need - Consistent Feature Generation ○ The value of your machine learning results is only as good as the data ○ Subtle changes to how a feature value is generated can significantly impact results ● Solution - Unify feature generation ○ Batch processing for bulk creation of features for training ML models ○ Stream processing for real-time creation of features for scoring ML models ● How - Flink SQL ○ Use Flink as the processing engine using streaming or bulk data ○ Add automation to make it super simple to launch and maintain feature generation programs at scale
  12. 12. Dryft Program { "source": "dryft", "query_file": "decl_ride_completed.sql", "kinesis": { "stream": "declridecompleted" }, "features": { "n_total_rides": { "description": "All time ride count per user", "type": "int", "version": 1 } } } SELECT COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1) AS user_id, COUNT(ride_id) as n_total_rides FROM event_ride_completed GROUP BY COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1)
  13. 13. Dryft Program Execution ● Backfill - read historic data from S3, process, sink to S3 ● Real-time - read stream data from Kinesis/Kafka, process, sink to DynamoDB SinkS3 Source SQL SinkKinesis/Kafka Source SQL
  14. 14. Bootstrapping ● Read historic data from S3 ● Transition to reading real-time data ● https://data-artisans.com/flink-forward/resources/bootstrappin g-state-in-apache-flink S3 Source Kinesis/Kafka Source Business Logic Sink < Target Time >= Target Time
  15. 15. Continuous Window Semantics ● Continuous window counts go up and down ● SELECT user_id, count(ride_id) OVER (PARTITION BY user_id ORDER BY rowtime RANGE INTERVAL '1' HOUR PRECEDING) from event_ride_completed ● Example: rides 1p, 1:30p, 2:15p, 3:45p ● Flink default is one message in, one message out ○ Output = 1@1p, 2@1:30p, 2@2:15p, 1@3:45p ● Dryft is one message in, two messages out ○ Output = 1@1p, 2@1:30p, 1@2p, 2@2:15p, 1@2:30p, 0@3:15p, 1@3:45p, 0@4:45p ● Other fancy SQL analysis tricks too
  16. 16. Beyond Feature Generation ● Reactive Programming ○ When an event occurs, execute some logic ● Asynchronous Programming ○ Perform more processing asynchronously ● Geotriggering ○ Emit an event when someone enters or leaves an area ● Change Data Capture ○ Mirror production data in analytical data stores
  17. 17. The End of the Stream ● Real-time Visualization w/ Druid and Superset ● Presto for <10s over large data sets ● Hive and Spark for extreme data sets ● Flyte - extreme data set workflows
  18. 18. Q&A

×