Streaming sql and druid

Arup Malakar | amalakar@lyft.com
1
Streaming SQL and Druid
Druid Bay Area Meetup @ Lyft

Agenda
• Use cases/motivations
• Architecture
‒ Apache Flink
‒ Druid
• Conclusion
2

Users
• Analysts
• Data Scientists
• Engineers
• Operations Team
• Executives
4
Access methods
• Dashboards
• Interactive Exploration
• Freeform SQL

Example questions
Realtime
• How is the new pickup location in SFO airport affecting the market?
Geospatial
• Are the promos we deployed earlier in a sub-region effective at moving the metrics we
thought they would move?
Anomaly
• Alert GMs when subregion conversion is low because lack of supply.
5

Limitations
• Only yesterday’s data is queryable
in analytical db
• P75 query latency in presto is 30
seconds
8
Requirements
• Data freshness < 1 minute
• P95 query latency < 5 seconds
• Geospatial support

Apache Flink - Stream processor
• Scalable/performant distributed stream processor
• API heavily influenced by Google’s Dataflow Model
• Event time processing
• APIs
‒ Functional APIs
‒ Stream SQL
‒ Direct APIs
• Joins
• Windowing
• Supports batch execution
10

Druid - Columnar database
● Scalable in-memory columnar database
● Support for geospatial data
● Extensible
● Native integration with superset
● Real time ingestion
11

Flink Stream SQL
● Familiarity with SQL
● Powerful semantics for data manipulation
● Streaming and batch mode
● Extensibility via UDF
● Joins
12

UDFs
● Geohash
● Geo region extraction
● URL cardinality reduction/normalization
○ /users/d9cca721a735d/location -> /users/{hash}/location
○ /v1//api// -> /v1/api
● User agent parsing
○ OS name / version
○ App Name / version
● Sampling
13

Validation of ingestion-spec
• Ingestion spec under source control
• Protobuf schema based compile time validation
‒ SQL
‒ Data type
‒ Column names
• Integration tests on sample data
17

Goal - all events in druid in realtime
• If you log it, you will find it
• Automagic druid spec
‒ Offline analysis for dimensions/metrics
‒ Cardinality analysis
‒ Reasonable defaults
• Auto provisioning of various resources
‒ Kafka topic for a new event
20

Conclusion
• Flink streaming SQL augments druid capabilities
‒ Transformation
‒ Joins
‒ Sampling
• Easy to use ingestion framework is crucial for adoption
22

Challenges/Next
• Good retention strategies
‒ Data size based/time based?
• Query rate limiting
• Flink batch ingestion
• Anomaly detection
• Root cause analysis
23

Streaming sql and druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Streaming sql and druid

Similar to Streaming sql and druid (20)

Recently uploaded

Recently uploaded (20)

Streaming sql and druid