Big Data at Speed
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
go.lyft.com/big-data-at-speed
The problem
User interface Analytical
interface
Users Decision maker
The problem
User interface Analytical
interface
Users Decision maker
Insight Lag
Newer version of the problem
User interface Analytical
interface
Users Decision maker
Insight Lag
The average insight lag is of
order of days.
How can we reduce the insight gap?
User interface Analytical
interface
Users Decision maker
Insight Lag
What contributes to ingest gap?
● Slow ingest and ETL
○ Derived data takes a while to become available.
● Slow human insights
○ Storage systems are not effective.
○ Tools for analyzing/gaining insights are not productive.
● Slow automated decisions
○ Developing and training models is hard.
Inside the “insight box” - historically
ETL
Engine Data
Warehouse
Source System
A
Source System
B
Source System
C
Agenda - Faster insights
Faster ingest
Faster (human) insights
Faster (automated)
decisions
How Lyft is pushing the envelope
● Detecting driver scarcity (or abundance) and incentivizing them to be where the passengers are
○ Marketplace imbalance is not good
● Marketplace parameters consists of:
○ Drivers
○ Passengers
○ Geography
○ Time!
● Decide using data, if/when/which incentive to deploy
● Deploy the right incentive automatically
1. Faster ingest
Agreement and Standards
● Define a scoped listed on input options
● Make it easy
● Make it reliable
● Make it scalable
Standard Messages Vs Custom Messages
● The eternal debate
○ One side is to restrictive
○ The other is disorder
Make it Flexible
● Strongly Typed Schemas
● Schema on Demand
● Auditable GDPR
● Define Routing on Demand
Pipes and Routing
● Routing through configurations
● Pipes on demand
Data Engineer vs. Data Scientist
Data Engineer Data Scientist
2. Faster human insights
Inside the “insight box” - historically
ETL
Engine Data
Warehouse
Source System
A
Source System
B
Source System
C
Inside the “insight box” - Now
Pipes
Analytical
Storage
Source System
A
Source System
B
Source System
C
Long Term
Storage
Searchable
Storage
Time Series
Storage
In Memory
Windowing
State
Auditing &
Governance
Inside the “insight box” - Now
Pipes
Analytical
Storage
Source System
A
Source System
B
Source System
C
Long Term
Storage
Searchable
Storage
Time Series
Storage
In Memory
Windowing
State
Auditing &
Governance
Archival and storage
Managed storage,
SQL queries
For a user X
Grafana, wavefront
style dashboards
Sessionization,
windowing, etc.
Multiple storage systems
Events
Message buffer
Analytical
Storage
Long Term
Storage
Searchable
Storage
Time Series
Storage
In Memory
Windowing
State
Importance of Auditing & Governance
● Protect against the disorder
● Isolation Kafka Topics for different use cases
● Topic creation and routing dynamically is key
Data Discovery - Search
Data Discovery - Understand
3. Faster automated decisions
How we take action: Learn and Act
Pipes Analytical
Storage
Source System
A
Source System
B
Source System
C
Analysis Programer
Actionable
Systems
How we take action: Batch Generated Actions
Pipes Analytical
Storage
Source System
A
Source System
B
Source System
C
Actionable
Systems
Batch Job
Programer
Automation
How we take action: Stream Generated Actions
Pipes Stream
Processing
Source System
A
Source System
B
Source System
C
Actionable
Systems
Pipes
Storage
Model
Reviewers
Inside the “insight box” - Now
Pipes
Analytical
Storage
Source System
A
Source System
B
Source System
C
Long Term
Storage
Searchable
Storage
Time Series
Storage
Stream
Processing
Auditing &
Governance
Actionable
Systems
Faster Decisions
● Need to have a mindset of streaming data
○ Streams are tables
■ Tumbling
■ Sliding
■ Sessionization
■ Custom
● Train in Streams
● Output is Streams
● All the things are Streams
Windowing Lead and Lag
Windowing Lead and Lag
Windowing Lead and Lag
Tumbling Windowing
Sliding Windowing
Session Windowing
Windowing as a Table
•
•
•
Windowing as a Table
•
•
•
Windowing as a Table
•
•
•
Windowing as a Table
•
•
•
Windowing as a Table
•
•
•
Streams are Tables
● Feature creation based on windows
● Batch as Streaming
○ Partition by Entity
○ Sort By Time
○ Flatmap for every window trigger
● Batch Model can be feed by Streaming Windows
● Output is a Stream as well
Journey From Input to Value
Pipes
Analytical
Storage
Source System
A
Source System
B
Source System
C
Long Term
Storage
Searchable
Storage
Time Series
Storage
Stream
Processing
Auditing &
Governance
Actionable
Systems
Conclusion
Summary
● Insight lag
● How can we shorten insight lag
○ Faster ingest
○ Faster human insights
○ Faster automated decisions
Thank you!
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
Icons under Creative Commons License from https://thenounproject.com/

Big Data at Speed

  • 1.
    Big Data atSpeed Mark Grover | @mark_grover Ted Malaska | @TedMalaska go.lyft.com/big-data-at-speed
  • 3.
    The problem User interfaceAnalytical interface Users Decision maker
  • 4.
    The problem User interfaceAnalytical interface Users Decision maker Insight Lag
  • 5.
    Newer version ofthe problem User interface Analytical interface Users Decision maker Insight Lag
  • 6.
    The average insightlag is of order of days.
  • 7.
    How can wereduce the insight gap? User interface Analytical interface Users Decision maker Insight Lag
  • 8.
    What contributes toingest gap? ● Slow ingest and ETL ○ Derived data takes a while to become available. ● Slow human insights ○ Storage systems are not effective. ○ Tools for analyzing/gaining insights are not productive. ● Slow automated decisions ○ Developing and training models is hard.
  • 9.
    Inside the “insightbox” - historically ETL Engine Data Warehouse Source System A Source System B Source System C
  • 10.
    Agenda - Fasterinsights Faster ingest Faster (human) insights Faster (automated) decisions
  • 11.
    How Lyft ispushing the envelope ● Detecting driver scarcity (or abundance) and incentivizing them to be where the passengers are ○ Marketplace imbalance is not good ● Marketplace parameters consists of: ○ Drivers ○ Passengers ○ Geography ○ Time! ● Decide using data, if/when/which incentive to deploy ● Deploy the right incentive automatically
  • 12.
  • 13.
    Agreement and Standards ●Define a scoped listed on input options ● Make it easy ● Make it reliable ● Make it scalable
  • 14.
    Standard Messages VsCustom Messages ● The eternal debate ○ One side is to restrictive ○ The other is disorder
  • 15.
    Make it Flexible ●Strongly Typed Schemas ● Schema on Demand ● Auditable GDPR ● Define Routing on Demand
  • 16.
    Pipes and Routing ●Routing through configurations ● Pipes on demand
  • 17.
    Data Engineer vs.Data Scientist Data Engineer Data Scientist
  • 18.
  • 19.
    Inside the “insightbox” - historically ETL Engine Data Warehouse Source System A Source System B Source System C
  • 20.
    Inside the “insightbox” - Now Pipes Analytical Storage Source System A Source System B Source System C Long Term Storage Searchable Storage Time Series Storage In Memory Windowing State Auditing & Governance
  • 21.
    Inside the “insightbox” - Now Pipes Analytical Storage Source System A Source System B Source System C Long Term Storage Searchable Storage Time Series Storage In Memory Windowing State Auditing & Governance Archival and storage Managed storage, SQL queries For a user X Grafana, wavefront style dashboards Sessionization, windowing, etc.
  • 22.
    Multiple storage systems Events Messagebuffer Analytical Storage Long Term Storage Searchable Storage Time Series Storage In Memory Windowing State
  • 23.
    Importance of Auditing& Governance ● Protect against the disorder ● Isolation Kafka Topics for different use cases ● Topic creation and routing dynamically is key
  • 24.
  • 25.
    Data Discovery -Understand
  • 26.
  • 27.
    How we takeaction: Learn and Act Pipes Analytical Storage Source System A Source System B Source System C Analysis Programer Actionable Systems
  • 28.
    How we takeaction: Batch Generated Actions Pipes Analytical Storage Source System A Source System B Source System C Actionable Systems Batch Job Programer Automation
  • 29.
    How we takeaction: Stream Generated Actions Pipes Stream Processing Source System A Source System B Source System C Actionable Systems Pipes Storage Model Reviewers
  • 30.
    Inside the “insightbox” - Now Pipes Analytical Storage Source System A Source System B Source System C Long Term Storage Searchable Storage Time Series Storage Stream Processing Auditing & Governance Actionable Systems
  • 31.
    Faster Decisions ● Needto have a mindset of streaming data ○ Streams are tables ■ Tumbling ■ Sliding ■ Sessionization ■ Custom ● Train in Streams ● Output is Streams ● All the things are Streams
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    Windowing as aTable • • •
  • 39.
    Windowing as aTable • • •
  • 40.
    Windowing as aTable • • •
  • 41.
    Windowing as aTable • • •
  • 42.
    Windowing as aTable • • •
  • 43.
    Streams are Tables ●Feature creation based on windows ● Batch as Streaming ○ Partition by Entity ○ Sort By Time ○ Flatmap for every window trigger ● Batch Model can be feed by Streaming Windows ● Output is a Stream as well
  • 44.
    Journey From Inputto Value Pipes Analytical Storage Source System A Source System B Source System C Long Term Storage Searchable Storage Time Series Storage Stream Processing Auditing & Governance Actionable Systems
  • 45.
  • 46.
    Summary ● Insight lag ●How can we shorten insight lag ○ Faster ingest ○ Faster human insights ○ Faster automated decisions
  • 47.
    Thank you! Mark Grover| @mark_grover Ted Malaska | @TedMalaska Icons under Creative Commons License from https://thenounproject.com/