SlideShare a Scribd company logo
1 of 45
Download to read offline
Near real-time
anomaly
detection at Lyft
Mark Grover | @mark_grover
Thomas Weise | @thweise
go.lyft.com/streaming-at-lyft
Agenda
● Data at Lyft
● 3 problems in streaming
● Conclusion
Data at Lyft
Lyft: Fastest ride sharing company in the US
Data platform users
5
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
Analytics Biz ops Building apps Experimentation
6
Data Platform architecture
Custom apps
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
Data platform users
7
Data Modelers Analysts Data
Scientists
General
Managers
Data Platform
Engineers ExperimentersProduct
Managers
Analytics Biz ops Building apps Experimentation
8
Data Platform architecture
Custom apps
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Models +
Applications (e.g.
ETA, Pricing)
Flyte
How can
streaming help
build better
applications?
1. Engineer
Responsibility
Build great products
Alerting Business metrics
Requirements
Anomaly detection on business metrics
Anomaly Detection use cases
Security Ops Payment fraud Customer service Accident detection
2. Data Scientist
Responsibility
Extract knowledge and insights from data
(To build better products)
Prototype in a language of
choice (Python, R, SQL)
Quick and simple ways
of “cleaning” data
Requirements
Prototype in a language of choice (Python, R, SQL)
Quick and simple ways of cleaning data
Data Science use cases - Driver app
Data Science use cases - Pricing
Historical architecture
State Store
Model 1 Model 2 Model 3
t=60s
t=60s
t=60s
t=63s
t=65s
t=66s
New architecture - Flink
State Store
Model 1 Model 2 Model 3
t=60s
t=63s
t=68s
t=63s
t=68s
t=74s
Today’s focus on 3 streaming use cases
1 Anomaly Detection
2
3
Making Data Prep Easy
Support non-JVM Languages
1. Anomaly
detection
What is the problem?
Security Ops
Payment fraud Customer
service
Accident detection
Business metrics
alerting
20
Anomaly detection architecture
Services (e.g.
ETA, Pricing)
Operational Data
stores (e.g.
Dynamo)
Impact
Business metric alerting Financial line items alerting
Challenges
● Barrier to entry is pretty high
○ Takes a long time to ingest and tune alerts
2. Making Data
Prep Easy
What is the problem?
● Data preparation - everyone needs it, examples:
○ Write raw data from stream to S3 for batch consumers
○ Filter, aggregate, … the usual ETL stuff
● Enable teams to focus on business problems, don’t worry
about “getting data in”
● Data ingress still is surprisingly difficult
○ Really?
○ Give our users a service that shields them from
infrastructure complexity
Dryft
fully managed data processing engine, powering real-time features and events
● Need - Consistent Feature Generation
○ The value of your machine learning results is only as good as the data
○ Subtle changes to how a feature value is generated can significantly impact results
● Solution - Unify feature generation
○ Batch processing for bulk creation of features for training ML models
○ Stream processing for real-time creation of features for scoring ML models
● How - Flink SQL
○ Use Flink as the processing engine using streaming or bulk data
○ Add automation to make it super simple to launch and maintain feature generation programs
at scale
https://www.slideshare.net/SeattleApacheFlinkMeetup/streaminglyft-greg-fee-seattle-apache-flink-meetup-104398613/#11
Dryft Program
Configuration file decl_ride_completed.sql
{
"source": "dryft",
"query_file": "decl_ride_completed.sql",
"kinesis": {
"stream": "declridecompleted" },
"features": {
"n_total_rides": {
"description": "All time ride count per user",
"type": "int",
"version": 1 }
}
}
SELECT COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1) AS user_id,
COUNT(ride_id) as n_total_rides
FROM event_ride_completed
GROUP BY COALESCE(user_lyft_id,
passenger_lyft_id, passenger_id, -1)
Dryft Program Execution
● Backfill - read historic data from S3, process, sink to S3
● Real-time - read stream data from Kinesis/Kafka, process, sink
to DynamoDB
SinkS3 Source SQL
SinkKinesis/Kafka Source SQL
Bootstrapping
● Read historic data from S3
● Transition to reading real-time data
● https://data-artisans.com/flink-forward/resources/bootstrappin
g-state-in-apache-flink
S3 Source
Kinesis/Kafka Source
Business
Logic
Sink
< Target Time
>= Target Time
When to Dryft
• Feature Generation as original driver
• Declarative Streaming ETL
‒ Stream to Table / Stream
• SQL - Simplicity <> Power tradeoff
‒ Flink SQL supports UDFs (written in Java)
‒ A UDF could also do a service call, but..
When we need Programming
https://ci.apache.org/projects/flink/flink-docs-stable/concepts/programming-model.html
Flink Streaming Options
• SQL - Dryft
• Java DataStream API - the usual starting point
‒ Sources, Sinks, Windowing, Implicit State Management
‒ Fluent style, high abstraction level
• ProcessFunction for advanced logic
‒ User code controlled state and timers
• Nice fit when Java is already established
‒ Forced language switch is hard sell, time to value long and less predictable
‒ Initial Flink Deployments at Lyft
‒ But we do a lot of stuff in Python..
3. Support
non-JVM
languages
What is the problem?
● Flink API primarily target Java developers
○ Most of our teams that want to solve streaming use
cases don’t work with Java
● Enable streaming native to the language ecosystem
○ Python is the primary option for ML
○ (Use cases not addressed by Dryft/Flink SQL)
Streaming Options for Python
• Jython != Python
‒ Flink Python API and few more
• Jep (Java Embedded Python)
• KCL workers, Kafka consumers as standalone services
• Spark PySpark
‒ Not so much streaming, different semantics
‒ Different deployment story
• Faust
‒ Kafka Streams inspired
‒ No out of the box deployment story
Apache Beam
1. End users: who want to write
pipelines in a language that’s familiar.
2. SDK writers: who want to make Beam
concepts available in new languages.
3. Runner writers: who have a
distributed processing environment
and want to support Beam pipelines
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution
https://s.apache.org/apache-beam-project-overview
Beam Python Example
def pipeline(root):
input = root | ReadFromText("/path/to/text*") | Map(lambda line: ...)
scores = (input
| WindowInto(FixedWindows(120)
trigger=AfterWatermark(
early=AfterProcessingTime(60),
late=AfterCount(1))
accumulation_mode=ACCUMULATING)
| CombinePerKey(sum))
scores | WriteToText("/path/to/outputs")
MyRunner().run(pipeline)
( What, Where, When, How )
Python on Flink via Beam
• Beam model and Flink go well along
‒ Flink Runner most advanced OSS option for Beam Java SDK
• Python SDK already available on Dataflow
• Beam Language Portability allows Python (and Go) SDK to work
with JVM-based runners
‒ Flink Runner is first to support portability
• Flink Deployment Story
‒ Extend to run Python via Beam on Flink
Python on Flink via Beam
Job Service
Artifact
Staging
Job Manager
Fn Services
SDK Harness /
UDFs
Provision Control Data
Artifact
Retrieval
State Logging
Cluster
Runner
Dependencies
(optional)
python -m apache_beam.examples.wordcount 
--input=/etc/profile 
--output=/tmp/py-wordcount-direct 
--experiments=beam_fn_api 
--runner=PortableRunner 
--sdk_location=container 
--job_endpoint=localhost:8099 
--streaming
https://s.apache.org/streaming-python-beam-flink
3.5 But, how do
we deploy all
this?
Deployment
40
Streaming
Application
(Dryft, Java,
Beam, ...)
Stream / Schema
Registry
Deployment
Tooling
Metrics &
Dashboards
Alerts Logging
Amazon
EC2
Amazon S3 Wavefront
Salt
(Config / Orca)
Docker
Source Sink
Future of Deployment
• Flink embraces containerization
‒ Reactive vs. Active Flink Container Mode
(resources supplied externally vs. actively requested)
• Kubernetes Operator
‒ Resource Elasticity
‒ Improved Resource Utilization
‒ Auto-Scaling Support
‒ Automate (stateful) upgrade
Learnings
• Integration
‒ Things work well in isolation, but..
‒ Flink Kinesis Consumer
‣ Connectors that work reliably at scale are easy hard
• Things we find at scale
‒ Intermittent AWS service errors (Kinesis, S3)
‣ Retry vs. topology reset
‒ S3 hotspotting with Flink checkpointing for large jobs (FLINK-9061)
‒ Naive pubsub consumption can lead to massive state buffering
‣ Align watermarks across source partitions
Conclusion
● Data at Lyft
● 3 problems in streaming
○ Anomaly Detection - Anodot
○ Easy data prep - Dryft
○ Non-JVM language support - Apache Beam
Conclusion
We are hiring!
lyft.com/careers
https://goo.gl/RsyLkS
go.lyft.com/streaming-at-lyft
Images from the Noun Project
Mark Grover | @mark_grover
Thomas Weise | @thweise

More Related Content

What's hot

CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumChengKuan Gan
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxData
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David AndersonVerverica
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflowDatabricks
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with PrometheusMaximilian Bode
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergFlink Forward
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...Altinity Ltd
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfAlkin Tezuysal
 

What's hot (20)

CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Introducing Change Data Capture with Debezium
Introducing Change Data Capture with DebeziumIntroducing Change Data Capture with Debezium
Introducing Change Data Capture with Debezium
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
 
Deploying Flink on Kubernetes - David Anderson
 Deploying Flink on Kubernetes - David Anderson Deploying Flink on Kubernetes - David Anderson
Deploying Flink on Kubernetes - David Anderson
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Introduction to MLflow
Introduction to MLflowIntroduction to MLflow
Introduction to MLflow
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with Prometheus
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Batch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & IcebergBatch Processing at Scale with Flink & Iceberg
Batch Processing at Scale with Flink & Iceberg
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 

Similar to Near real-time anomaly detection at Lyft

Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flinkconfluent
 
FluentD for end to end monitoring
FluentD for end to end monitoringFluentD for end to end monitoring
FluentD for end to end monitoringPhil Wilkins
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesKarthik Murugesan
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyYaroslav Tkachenko
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...GetInData
 
Webinar september 2013
Webinar september 2013Webinar september 2013
Webinar september 2013Marc Gille
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
 
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoFlyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoKetanUmare
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowKaxil Naik
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Value Association
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkKostas Tzoumas
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System OverviewFlink Forward
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analyticsXiang Fu
 
Taking Your FDM Application to the Next Level with Advanced Scripting
Taking Your FDM Application to the Next Level with Advanced ScriptingTaking Your FDM Application to the Next Level with Advanced Scripting
Taking Your FDM Application to the Next Level with Advanced ScriptingAlithya
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesMichael Stephenson
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Bowen Li
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureTimothy Spann
 

Similar to Near real-time anomaly detection at Lyft (20)

Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
FluentD for end to end monitoring
FluentD for end to end monitoringFluentD for end to end monitoring
FluentD for end to end monitoring
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Lyft data Platform - 2019 slides
Lyft data Platform - 2019 slidesLyft data Platform - 2019 slides
Lyft data Platform - 2019 slides
 
Apache Flink Adoption at Shopify
Apache Flink Adoption at ShopifyApache Flink Adoption at Shopify
Apache Flink Adoption at Shopify
 
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
Functioning incessantly of Data Science Platform with Kubeflow - Albert Lewan...
 
Webinar september 2013
Webinar september 2013Webinar september 2013
Webinar september 2013
 
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
 
Flyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiegoFlyte kubecon 2019 SanDiego
Flyte kubecon 2019 SanDiego
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Apache Flink Training: System Overview
Apache Flink Training: System OverviewApache Flink Training: System Overview
Apache Flink Training: System Overview
 
Resume
ResumeResume
Resume
 
Scaling up uber's real time data analytics
Scaling up uber's real time data analyticsScaling up uber's real time data analytics
Scaling up uber's real time data analytics
 
Taking Your FDM Application to the Next Level with Advanced Scripting
Taking Your FDM Application to the Next Level with Advanced ScriptingTaking Your FDM Application to the Next Level with Advanced Scripting
Taking Your FDM Application to the Next Level with Advanced Scripting
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
RAGHUNATH_GORLA_RESUME
RAGHUNATH_GORLA_RESUMERAGHUNATH_GORLA_RESUME
RAGHUNATH_GORLA_RESUME
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 

More from markgrover

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting datamarkgrover
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 markgrover
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy designmarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security datamarkgrover
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadatamarkgrover
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beammarkgrover
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speedmarkgrover
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyftmarkgrover
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoopmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsmarkgrover
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 

More from markgrover (20)

From discovering to trusting data
From discovering to trusting dataFrom discovering to trusting data
From discovering to trusting data
 
Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020 Amundsen lineage designs - community meeting, Dec 2020
Amundsen lineage designs - community meeting, Dec 2020
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
Amundsen gremlin proxy design
Amundsen gremlin proxy designAmundsen gremlin proxy design
Amundsen gremlin proxy design
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Amundsen: From discovering to security data
Amundsen: From discovering to security dataAmundsen: From discovering to security data
Amundsen: From discovering to security data
 
Data Discovery & Trust through Metadata
Data Discovery & Trust through MetadataData Discovery & Trust through Metadata
Data Discovery & Trust through Metadata
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
TensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache BeamTensorFlow Extension (TFX) and Apache Beam
TensorFlow Extension (TFX) and Apache Beam
 
Big Data at Speed
Big Data at SpeedBig Data at Speed
Big Data at Speed
 
Dogfooding data at Lyft
Dogfooding data at LyftDogfooding data at Lyft
Dogfooding data at Lyft
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 

Recently uploaded

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Near real-time anomaly detection at Lyft

  • 1. Near real-time anomaly detection at Lyft Mark Grover | @mark_grover Thomas Weise | @thweise go.lyft.com/streaming-at-lyft
  • 2. Agenda ● Data at Lyft ● 3 problems in streaming ● Conclusion
  • 4. Lyft: Fastest ride sharing company in the US
  • 5. Data platform users 5 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers Analytics Biz ops Building apps Experimentation
  • 6. 6 Data Platform architecture Custom apps Services (e.g. ETA, Pricing) Operational Data stores (e.g. Dynamo) Models + Applications (e.g. ETA, Pricing) Flyte
  • 7. Data platform users 7 Data Modelers Analysts Data Scientists General Managers Data Platform Engineers ExperimentersProduct Managers Analytics Biz ops Building apps Experimentation
  • 8. 8 Data Platform architecture Custom apps Services (e.g. ETA, Pricing) Operational Data stores (e.g. Dynamo) Models + Applications (e.g. ETA, Pricing) Flyte
  • 9. How can streaming help build better applications?
  • 10. 1. Engineer Responsibility Build great products Alerting Business metrics Requirements Anomaly detection on business metrics
  • 11. Anomaly Detection use cases Security Ops Payment fraud Customer service Accident detection
  • 12. 2. Data Scientist Responsibility Extract knowledge and insights from data (To build better products) Prototype in a language of choice (Python, R, SQL) Quick and simple ways of “cleaning” data Requirements Prototype in a language of choice (Python, R, SQL) Quick and simple ways of cleaning data
  • 13. Data Science use cases - Driver app
  • 14. Data Science use cases - Pricing
  • 15. Historical architecture State Store Model 1 Model 2 Model 3 t=60s t=60s t=60s t=63s t=65s t=66s
  • 16. New architecture - Flink State Store Model 1 Model 2 Model 3 t=60s t=63s t=68s t=63s t=68s t=74s
  • 17. Today’s focus on 3 streaming use cases 1 Anomaly Detection 2 3 Making Data Prep Easy Support non-JVM Languages
  • 19. What is the problem? Security Ops Payment fraud Customer service Accident detection Business metrics alerting
  • 20. 20 Anomaly detection architecture Services (e.g. ETA, Pricing) Operational Data stores (e.g. Dynamo)
  • 21. Impact Business metric alerting Financial line items alerting
  • 22. Challenges ● Barrier to entry is pretty high ○ Takes a long time to ingest and tune alerts
  • 24. What is the problem? ● Data preparation - everyone needs it, examples: ○ Write raw data from stream to S3 for batch consumers ○ Filter, aggregate, … the usual ETL stuff ● Enable teams to focus on business problems, don’t worry about “getting data in” ● Data ingress still is surprisingly difficult ○ Really? ○ Give our users a service that shields them from infrastructure complexity
  • 25. Dryft fully managed data processing engine, powering real-time features and events ● Need - Consistent Feature Generation ○ The value of your machine learning results is only as good as the data ○ Subtle changes to how a feature value is generated can significantly impact results ● Solution - Unify feature generation ○ Batch processing for bulk creation of features for training ML models ○ Stream processing for real-time creation of features for scoring ML models ● How - Flink SQL ○ Use Flink as the processing engine using streaming or bulk data ○ Add automation to make it super simple to launch and maintain feature generation programs at scale https://www.slideshare.net/SeattleApacheFlinkMeetup/streaminglyft-greg-fee-seattle-apache-flink-meetup-104398613/#11
  • 26. Dryft Program Configuration file decl_ride_completed.sql { "source": "dryft", "query_file": "decl_ride_completed.sql", "kinesis": { "stream": "declridecompleted" }, "features": { "n_total_rides": { "description": "All time ride count per user", "type": "int", "version": 1 } } } SELECT COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1) AS user_id, COUNT(ride_id) as n_total_rides FROM event_ride_completed GROUP BY COALESCE(user_lyft_id, passenger_lyft_id, passenger_id, -1)
  • 27. Dryft Program Execution ● Backfill - read historic data from S3, process, sink to S3 ● Real-time - read stream data from Kinesis/Kafka, process, sink to DynamoDB SinkS3 Source SQL SinkKinesis/Kafka Source SQL
  • 28. Bootstrapping ● Read historic data from S3 ● Transition to reading real-time data ● https://data-artisans.com/flink-forward/resources/bootstrappin g-state-in-apache-flink S3 Source Kinesis/Kafka Source Business Logic Sink < Target Time >= Target Time
  • 29. When to Dryft • Feature Generation as original driver • Declarative Streaming ETL ‒ Stream to Table / Stream • SQL - Simplicity <> Power tradeoff ‒ Flink SQL supports UDFs (written in Java) ‒ A UDF could also do a service call, but..
  • 30. When we need Programming https://ci.apache.org/projects/flink/flink-docs-stable/concepts/programming-model.html
  • 31. Flink Streaming Options • SQL - Dryft • Java DataStream API - the usual starting point ‒ Sources, Sinks, Windowing, Implicit State Management ‒ Fluent style, high abstraction level • ProcessFunction for advanced logic ‒ User code controlled state and timers • Nice fit when Java is already established ‒ Forced language switch is hard sell, time to value long and less predictable ‒ Initial Flink Deployments at Lyft ‒ But we do a lot of stuff in Python..
  • 33. What is the problem? ● Flink API primarily target Java developers ○ Most of our teams that want to solve streaming use cases don’t work with Java ● Enable streaming native to the language ecosystem ○ Python is the primary option for ML ○ (Use cases not addressed by Dryft/Flink SQL)
  • 34. Streaming Options for Python • Jython != Python ‒ Flink Python API and few more • Jep (Java Embedded Python) • KCL workers, Kafka consumers as standalone services • Spark PySpark ‒ Not so much streaming, different semantics ‒ Different deployment story • Faust ‒ Kafka Streams inspired ‒ No out of the box deployment story
  • 35. Apache Beam 1. End users: who want to write pipelines in a language that’s familiar. 2. SDK writers: who want to make Beam concepts available in new languages. 3. Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution https://s.apache.org/apache-beam-project-overview
  • 36. Beam Python Example def pipeline(root): input = root | ReadFromText("/path/to/text*") | Map(lambda line: ...) scores = (input | WindowInto(FixedWindows(120) trigger=AfterWatermark( early=AfterProcessingTime(60), late=AfterCount(1)) accumulation_mode=ACCUMULATING) | CombinePerKey(sum)) scores | WriteToText("/path/to/outputs") MyRunner().run(pipeline) ( What, Where, When, How )
  • 37. Python on Flink via Beam • Beam model and Flink go well along ‒ Flink Runner most advanced OSS option for Beam Java SDK • Python SDK already available on Dataflow • Beam Language Portability allows Python (and Go) SDK to work with JVM-based runners ‒ Flink Runner is first to support portability • Flink Deployment Story ‒ Extend to run Python via Beam on Flink
  • 38. Python on Flink via Beam Job Service Artifact Staging Job Manager Fn Services SDK Harness / UDFs Provision Control Data Artifact Retrieval State Logging Cluster Runner Dependencies (optional) python -m apache_beam.examples.wordcount --input=/etc/profile --output=/tmp/py-wordcount-direct --experiments=beam_fn_api --runner=PortableRunner --sdk_location=container --job_endpoint=localhost:8099 --streaming https://s.apache.org/streaming-python-beam-flink
  • 39. 3.5 But, how do we deploy all this?
  • 40. Deployment 40 Streaming Application (Dryft, Java, Beam, ...) Stream / Schema Registry Deployment Tooling Metrics & Dashboards Alerts Logging Amazon EC2 Amazon S3 Wavefront Salt (Config / Orca) Docker Source Sink
  • 41. Future of Deployment • Flink embraces containerization ‒ Reactive vs. Active Flink Container Mode (resources supplied externally vs. actively requested) • Kubernetes Operator ‒ Resource Elasticity ‒ Improved Resource Utilization ‒ Auto-Scaling Support ‒ Automate (stateful) upgrade
  • 42. Learnings • Integration ‒ Things work well in isolation, but.. ‒ Flink Kinesis Consumer ‣ Connectors that work reliably at scale are easy hard • Things we find at scale ‒ Intermittent AWS service errors (Kinesis, S3) ‣ Retry vs. topology reset ‒ S3 hotspotting with Flink checkpointing for large jobs (FLINK-9061) ‒ Naive pubsub consumption can lead to massive state buffering ‣ Align watermarks across source partitions
  • 44. ● Data at Lyft ● 3 problems in streaming ○ Anomaly Detection - Anodot ○ Easy data prep - Dryft ○ Non-JVM language support - Apache Beam Conclusion
  • 45. We are hiring! lyft.com/careers https://goo.gl/RsyLkS go.lyft.com/streaming-at-lyft Images from the Noun Project Mark Grover | @mark_grover Thomas Weise | @thweise