Successfully reported this slideshow.

Streaming sql and druid

2

Share

Upcoming SlideShare
Stream Sql with Flink @ Yelp
Stream Sql with Flink @ Yelp
Loading in …3
×
1 of 24
1 of 24

Streaming sql and druid

2

Share

Download to read offline

Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft
uses flink sql and druid together to support real time analytics.

Meetup: https://www.meetup.com/druidio/events/252515792/

Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft
uses flink sql and druid together to support real time analytics.

Meetup: https://www.meetup.com/druidio/events/252515792/

More Related Content

Streaming sql and druid

  1. 1. Arup Malakar | amalakar@lyft.com 1 Streaming SQL and Druid Druid Bay Area Meetup @ Lyft
  2. 2. Agenda • Use cases/motivations • Architecture ‒ Apache Flink ‒ Druid • Conclusion 2
  3. 3. Use cases 3
  4. 4. Users • Analysts • Data Scientists • Engineers • Operations Team • Executives 4 Access methods • Dashboards • Interactive Exploration • Freeform SQL
  5. 5. Example questions Realtime • How is the new pickup location in SFO airport affecting the market? Geospatial • Are the promos we deployed earlier in a sub-region effective at moving the metrics we thought they would move? Anomaly • Alert GMs when subregion conversion is low because lack of supply. 5
  6. 6. Architecture 6
  7. 7. Earlier 7
  8. 8. Limitations • Only yesterday’s data is queryable in analytical db • P75 query latency in presto is 30 seconds 8 Requirements • Data freshness < 1 minute • P95 query latency < 5 seconds • Geospatial support
  9. 9. Now 9
  10. 10. Apache Flink - Stream processor • Scalable/performant distributed stream processor • API heavily influenced by Google’s Dataflow Model • Event time processing • APIs ‒ Functional APIs ‒ Stream SQL ‒ Direct APIs • Joins • Windowing • Supports batch execution 10
  11. 11. Druid - Columnar database ● Scalable in-memory columnar database ● Support for geospatial data ● Extensible ● Native integration with superset ● Real time ingestion 11
  12. 12. Flink Stream SQL ● Familiarity with SQL ● Powerful semantics for data manipulation ● Streaming and batch mode ● Extensibility via UDF ● Joins 12
  13. 13. UDFs ● Geohash ● Geo region extraction ● URL cardinality reduction/normalization ○ /users/d9cca721a735d/location -> /users/{hash}/location ○ /v1//api// -> /v1/api ● User agent parsing ○ OS name / version ○ App Name / version ● Sampling 13
  14. 14. Lyft Druid Spec 14
  15. 15. Flink SQL 15
  16. 16. Data Flow 16
  17. 17. Validation of ingestion-spec • Ingestion spec under source control • Protobuf schema based compile time validation ‒ SQL ‒ Data type ‒ Column names • Integration tests on sample data 17
  18. 18. Exploration 18
  19. 19. Next 19
  20. 20. Goal - all events in druid in realtime • If you log it, you will find it • Automagic druid spec ‒ Offline analysis for dimensions/metrics ‒ Cardinality analysis ‒ Reasonable defaults • Auto provisioning of various resources ‒ Kafka topic for a new event 20
  21. 21. Conclusion 21
  22. 22. Conclusion • Flink streaming SQL augments druid capabilities ‒ Transformation ‒ Joins ‒ Sampling • Easy to use ingestion framework is crucial for adoption 22
  23. 23. Challenges/Next • Good retention strategies ‒ Data size based/time based? • Query rate limiting • Flink batch ingestion • Anomaly detection • Root cause analysis 23
  24. 24. Thank you! 24

×