Arup Malakar | amalakar@lyft.com
1
Streaming SQL and Druid
Druid Bay Area Meetup @ Lyft
Agenda
• Use cases/motivations
• Architecture
‒ Apache Flink
‒ Druid
• Conclusion
2
Use cases
3
Users
• Analysts
• Data Scientists
• Engineers
• Operations Team
• Executives
4
Access methods
• Dashboards
• Interactive Exploration
• Freeform SQL
Example questions
Realtime
• How is the new pickup location in SFO airport affecting the market?
Geospatial
• Are the promos we deployed earlier in a sub-region effective at moving the metrics we
thought they would move?
Anomaly
• Alert GMs when subregion conversion is low because lack of supply.
5
Architecture
6
Earlier
7
Limitations
• Only yesterday’s data is queryable
in analytical db
• P75 query latency in presto is 30
seconds
8
Requirements
• Data freshness < 1 minute
• P95 query latency < 5 seconds
• Geospatial support
Now
9
Apache Flink - Stream processor
• Scalable/performant distributed stream processor
• API heavily influenced by Google’s Dataflow Model
• Event time processing
• APIs
‒ Functional APIs
‒ Stream SQL
‒ Direct APIs
• Joins
• Windowing
• Supports batch execution
10
Druid - Columnar database
● Scalable in-memory columnar database
● Support for geospatial data
● Extensible
● Native integration with superset
● Real time ingestion
11
Flink Stream SQL
● Familiarity with SQL
● Powerful semantics for data manipulation
● Streaming and batch mode
● Extensibility via UDF
● Joins
12
UDFs
● Geohash
● Geo region extraction
● URL cardinality reduction/normalization
○ /users/d9cca721a735d/location -> /users/{hash}/location
○ /v1//api// -> /v1/api
● User agent parsing
○ OS name / version
○ App Name / version
● Sampling
13
Lyft Druid Spec
14
Flink SQL
15
Data Flow
16
Validation of ingestion-spec
• Ingestion spec under source control
• Protobuf schema based compile time validation
‒ SQL
‒ Data type
‒ Column names
• Integration tests on sample data
17
Exploration
18
Next
19
Goal - all events in druid in realtime
• If you log it, you will find it
• Automagic druid spec
‒ Offline analysis for dimensions/metrics
‒ Cardinality analysis
‒ Reasonable defaults
• Auto provisioning of various resources
‒ Kafka topic for a new event
20
Conclusion
21
Conclusion
• Flink streaming SQL augments druid capabilities
‒ Transformation
‒ Joins
‒ Sampling
• Easy to use ingestion framework is crucial for adoption
22
Challenges/Next
• Good retention strategies
‒ Data size based/time based?
• Query rate limiting
• Flink batch ingestion
• Anomaly detection
• Root cause analysis
23
Thank you!
24

Streaming sql and druid

  • 1.
    Arup Malakar |amalakar@lyft.com 1 Streaming SQL and Druid Druid Bay Area Meetup @ Lyft
  • 2.
    Agenda • Use cases/motivations •Architecture ‒ Apache Flink ‒ Druid • Conclusion 2
  • 3.
  • 4.
    Users • Analysts • DataScientists • Engineers • Operations Team • Executives 4 Access methods • Dashboards • Interactive Exploration • Freeform SQL
  • 5.
    Example questions Realtime • Howis the new pickup location in SFO airport affecting the market? Geospatial • Are the promos we deployed earlier in a sub-region effective at moving the metrics we thought they would move? Anomaly • Alert GMs when subregion conversion is low because lack of supply. 5
  • 6.
  • 7.
  • 8.
    Limitations • Only yesterday’sdata is queryable in analytical db • P75 query latency in presto is 30 seconds 8 Requirements • Data freshness < 1 minute • P95 query latency < 5 seconds • Geospatial support
  • 9.
  • 10.
    Apache Flink -Stream processor • Scalable/performant distributed stream processor • API heavily influenced by Google’s Dataflow Model • Event time processing • APIs ‒ Functional APIs ‒ Stream SQL ‒ Direct APIs • Joins • Windowing • Supports batch execution 10
  • 11.
    Druid - Columnardatabase ● Scalable in-memory columnar database ● Support for geospatial data ● Extensible ● Native integration with superset ● Real time ingestion 11
  • 12.
    Flink Stream SQL ●Familiarity with SQL ● Powerful semantics for data manipulation ● Streaming and batch mode ● Extensibility via UDF ● Joins 12
  • 13.
    UDFs ● Geohash ● Georegion extraction ● URL cardinality reduction/normalization ○ /users/d9cca721a735d/location -> /users/{hash}/location ○ /v1//api// -> /v1/api ● User agent parsing ○ OS name / version ○ App Name / version ● Sampling 13
  • 14.
  • 15.
  • 16.
  • 17.
    Validation of ingestion-spec •Ingestion spec under source control • Protobuf schema based compile time validation ‒ SQL ‒ Data type ‒ Column names • Integration tests on sample data 17
  • 18.
  • 19.
  • 20.
    Goal - allevents in druid in realtime • If you log it, you will find it • Automagic druid spec ‒ Offline analysis for dimensions/metrics ‒ Cardinality analysis ‒ Reasonable defaults • Auto provisioning of various resources ‒ Kafka topic for a new event 20
  • 21.
  • 22.
    Conclusion • Flink streamingSQL augments druid capabilities ‒ Transformation ‒ Joins ‒ Sampling • Easy to use ingestion framework is crucial for adoption 22
  • 23.
    Challenges/Next • Good retentionstrategies ‒ Data size based/time based? • Query rate limiting • Flink batch ingestion • Anomaly detection • Root cause analysis 23
  • 24.