6. Fyber - Druid requirements
■ Cube +80 Dimensions and 20 Metrics
■ Performance Query 3 month of data in 6 seconds (3 dimensions)
■ Size 5T raw data per day to index
6
8. 8
Spark stream from Json to Parquet S3 Spark batch for clean cardinality , pre-agg , enrich data (K8s)
Partial data (materialized view)
Data - Pipeline
9. Hour→Day→Week→...
9
Motivation
Less segments you have , less cores will be used per query (core per segment) → serve more concurrent users
BUT if 1 core read 700M of data while other cores are not in use its also not good design → need to find the right tune
partition - data/segments should split evenly (long tail...)
By doing aggregation of aggregation we minimize data size , reduce #segments
■ 1 Hour 10 segments of 200M
■ 1 Day 100 segments of 220M ( ~ reduce data by 50% compare to 240 * 220M )
■ We have 900 cores (30 nodes , each has 32 cores) -- problematic to read 9000 segments
10. “Materialized Views”
10
Motivation
● Several small cubes in which the dimensions has correlation
○ Row correlation , assume dimension is country (220 rows) impact of
■ adding gender is 440 rows
■ adding country phone (+072 Israel ) prefix will not add new rows
○ Business correlation like device detail cube (OS / Carrier)
● One large cube with all dimensions that will be used via filter and not topN query
● Use cardinality byRow with time series query to measure the dimensions correlation
● we modify the UI to handle cubes logic by query the smallest cube which answer user dimensions
● Our rule of thumb ~10M rows per small daily cube (most queries are on daily cubes)
11. Materialized Views
Cubes sync - user can see not aligned data during query of last day - need to manage druid state (mysql)
12. Airflow
12
■ Scheduler
■ Recover from failure
■ UI
■ Each task monitor itself , and autofix if needed including sending atomic alerts per Dag (since airflow 10)
13. 13
We collect druid clients usage such as :
● average query time
● query time range (last 2 days , or last 3 month)
● popular dimensions
This allows us to check if we tune needed #segments / cubes separation
Should we move to druid new ingestion instead of EMR ?
Should we move to druid new materialized views ?
Added anomaly detection above druid (based on https://github.com/yahoo/egads)
Day after deployment