Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming sql and druid

Druid provides sub-second query latency and Flink provides SQL on streams allowing rich transformation/enrichment of events as it happens. In this talk we will learn how Lyft
uses flink sql and druid together to support real time analytics.

Meetup: https://www.meetup.com/druidio/events/252515792/

  • Login to see the comments

Streaming sql and druid

  1. 1. Arup Malakar | amalakar@lyft.com 1 Streaming SQL and Druid Druid Bay Area Meetup @ Lyft
  2. 2. Agenda • Use cases/motivations • Architecture ‒ Apache Flink ‒ Druid • Conclusion 2
  3. 3. Use cases 3
  4. 4. Users • Analysts • Data Scientists • Engineers • Operations Team • Executives 4 Access methods • Dashboards • Interactive Exploration • Freeform SQL
  5. 5. Example questions Realtime • How is the new pickup location in SFO airport affecting the market? Geospatial • Are the promos we deployed earlier in a sub-region effective at moving the metrics we thought they would move? Anomaly • Alert GMs when subregion conversion is low because lack of supply. 5
  6. 6. Architecture 6
  7. 7. Earlier 7
  8. 8. Limitations • Only yesterday’s data is queryable in analytical db • P75 query latency in presto is 30 seconds 8 Requirements • Data freshness < 1 minute • P95 query latency < 5 seconds • Geospatial support
  9. 9. Now 9
  10. 10. Apache Flink - Stream processor • Scalable/performant distributed stream processor • API heavily influenced by Google’s Dataflow Model • Event time processing • APIs ‒ Functional APIs ‒ Stream SQL ‒ Direct APIs • Joins • Windowing • Supports batch execution 10
  11. 11. Druid - Columnar database ● Scalable in-memory columnar database ● Support for geospatial data ● Extensible ● Native integration with superset ● Real time ingestion 11
  12. 12. Flink Stream SQL ● Familiarity with SQL ● Powerful semantics for data manipulation ● Streaming and batch mode ● Extensibility via UDF ● Joins 12
  13. 13. UDFs ● Geohash ● Geo region extraction ● URL cardinality reduction/normalization ○ /users/d9cca721a735d/location -> /users/{hash}/location ○ /v1//api// -> /v1/api ● User agent parsing ○ OS name / version ○ App Name / version ● Sampling 13
  14. 14. Lyft Druid Spec 14
  15. 15. Flink SQL 15
  16. 16. Data Flow 16
  17. 17. Validation of ingestion-spec • Ingestion spec under source control • Protobuf schema based compile time validation ‒ SQL ‒ Data type ‒ Column names • Integration tests on sample data 17
  18. 18. Exploration 18
  19. 19. Next 19
  20. 20. Goal - all events in druid in realtime • If you log it, you will find it • Automagic druid spec ‒ Offline analysis for dimensions/metrics ‒ Cardinality analysis ‒ Reasonable defaults • Auto provisioning of various resources ‒ Kafka topic for a new event 20
  21. 21. Conclusion 21
  22. 22. Conclusion • Flink streaming SQL augments druid capabilities ‒ Transformation ‒ Joins ‒ Sampling • Easy to use ingestion framework is crucial for adoption 22
  23. 23. Challenges/Next • Good retention strategies ‒ Data size based/time based? • Query rate limiting • Flink batch ingestion • Anomaly detection • Root cause analysis 23
  24. 24. Thank you! 24

×