Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming at Lyft - Greg Fee - Seattle Apache Flink Meetup

622 views

Published on

Streaming at Lyft - Greg Fee - Seattle Apache Flink Meetup 06/28/2018

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Streaming at Lyft - Greg Fee - Seattle Apache Flink Meetup

  1. 1. Streaming@Lyft Flink Meetup June 2018 Gregory Fee
  2. 2. About Me ● Engineer @ Lyft ● Teams - ETA, Data Science Platform, Data Platform ● Accomplishments ○ ETA model training from 4 months to every 10 minutes ○ Real-time traffic updates ○ Flyte - Large Scale Orchestration and Batch Compute ○ Lyftlearn - Custom Machine Learning Library ○ Dryft - Real-time Feature Generation for Machine Learning
  3. 3. Streaming Use Cases ● Firehose -> S3 for Hive and Presto ● Real-time Traffic ● Primetime + Heatmaps ● Fraud Detection ● Ride Receipts ● Driver incentives ● Passenger coupons ● Anomaly Detection ● Map Correction ● Location Smoothing ● Bad Experience Detection ● Accident Detection ● ...and so much more
  4. 4. A Brief History of Lyft ● <2015 - monolithic PHP app, Redshift ● 2015 - Python services, “workers”, Fanner, Spark, Job Scheduler ● 2016 - Go services, Hive ● 2017 - Presto ● 2018 - Druid, Kafka, Flink, Beam
  5. 5. Streaming Architecture Overview Mobile Services Ingest/ Enrich Fanner KCL Job Scheduler S3
  6. 6. Fanner ● Pub-sub layer on Kinesis ● Curates output streams based on event type and simple value filters ● Pros - easy to use, integrated with JobScheduler ● Cons - no ordering guarantees, limited scaling Kinesis All Events Fanner Kinesis Event A Kinesis Event B Kinesis Event A + B
  7. 7. Job Scheduler ● Register data to call webhook ● Fire and forget ● Immediate or scheduled ● Pros - easy asynchronous programming, scales well ● Cons - no ordering guarantees, suboptimal outage handling, no replay Service SQS Workers Target Service
  8. 8. “Workers” ● Simple asynchronous/stream programming ● KCL worker to grab data from Kinesis, transform, store in Redis ● Additional workers read from Redis, transform, store in Redis ● Pros - familiar programming paradigm ● Cons - failure handling, checkpointing, joins, replay, etc. are roll your own (aka error prone)
  9. 9. Next Generation Goals ● Build Community ● Lower support burden ● Enhance developer ergonomics ● Scale with the business ● Support additional use cases
  10. 10. Next Generation Overview ● Pub-Sub -> Kafka ○ Large community, mature technology ● Stream Processing -> Flink ○ Growing community, event time processing ● Apache Beam for cross language support ● SPaaS ○ Kubernetes ○ Dryft
  11. 11. Dryft ● Need - Consistent Feature Generation ○ The value of your machine learning results is only as good as the data ○ Subtle changes to how a feature value is generated can significantly impact results ● Solution - Unify feature generation ○ Batch processing for bulk creation of features for training ML models ○ Stream processing for real-time creation of features for scoring ML models ● How - Flink SQL ○ Use Flink as the processing engine using streaming or bulk data ○ Add automation to make it super simple to launch and maintain feature generation programs at scale
  12. 12. Dryft Program { "source": "dryft", "query_file": "decl_ride_completed.sql", "kinesis": { "stream": "declridecompleted" }, "features": { "n_total_rides": { "description": "All time ride count per user", "type": "int", "version": 1 } } } SELECT COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1) AS user_id, COUNT(ride_id) as n_total_rides FROM event_ride_completed GROUP BY COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1)
  13. 13. Dryft Program Execution ● Backfill - read historic data from S3, process, sink to S3 ● Real-time - read stream data from Kinesis/Kafka, process, sink to DynamoDB SinkS3 Source SQL SinkKinesis/Kafka Source SQL
  14. 14. Bootstrapping ● Read historic data from S3 ● Transition to reading real-time data ● https://data-artisans.com/flink-forward/resources/bootstrappin g-state-in-apache-flink S3 Source Kinesis/Kafka Source Business Logic Sink < Target Time >= Target Time
  15. 15. Continuous Window Semantics ● Continuous window counts go up and down ● SELECT user_id, count(ride_id) OVER (PARTITION BY user_id ORDER BY rowtime RANGE INTERVAL '1' HOUR PRECEDING) from event_ride_completed ● Example: rides 1p, 1:30p, 2:15p, 3:45p ● Flink default is one message in, one message out ○ Output = 1@1p, 2@1:30p, 2@2:15p, 1@3:45p ● Dryft is one message in, two messages out ○ Output = 1@1p, 2@1:30p, 1@2p, 2@2:15p, 1@2:30p, 0@3:15p, 1@3:45p, 0@4:45p ● Other fancy SQL analysis tricks too
  16. 16. Beyond Feature Generation ● Reactive Programming ○ When an event occurs, execute some logic ● Asynchronous Programming ○ Perform more processing asynchronously ● Geotriggering ○ Emit an event when someone enters or leaves an area ● Change Data Capture ○ Mirror production data in analytical data stores
  17. 17. The End of the Stream ● Real-time Visualization w/ Druid and Superset ● Presto for <10s over large data sets ● Hive and Spark for extreme data sets ● Flyte - extreme data set workflows
  18. 18. Q&A

×