5. Dataprime Query Engine
■ Custom distributed query engine for proprietary query language (DataPrime)
on arbitrary semi-structured data
■ Querying data stored in object storage
■ Storage format is specialized parquet files
6. ■ Reading parquet metadata from object storage is too expensive for large
queries
■ Move metadata into separate (faster) storage
■ Block listing with bloom filters
■ Transactional commit log
Metastore: Motivation
7. Requirements for metastore
■ Low latency
■ Scaleable
■ Transactional guarantees
Initial implementation: postgres
Metastore: Requirements
Example one (large) customer:
■ 2k parquet files / hour
■ 50k parquet files / day
■ 15 TB of data / day
■ 20 GB of metadata / day
8. Important for listing: give me all the blocks for a table in a given time range
Example:
■ Block url:
s3://cgx-production-c4c-archive-data/cx/parquet/v1/team_id=555585/…
…dt=2022-12-02/hr=10/0246f9e9-f0da-4723-9b64-a12346095d25.parquet
■ Row group: 0, 1, 2 …
■ Min timestamp
■ Max timestamp
■ Number of rows
■ Total size
■ …
Blocks
9. Bloom Filters
Used for pruning blocks when filtering by search term
■ is a given token maybe in this block or definitely not?
Works by hashing tokens multiple times & setting bits to 1. When checking, hash
again and check bits are all 1.
Specifically using blocked bloom filter (sequence of bloom filters):
8192 * 32 bytes
10. Column Metadata
Per-column parquet metadata required for scanning and decoding
parquet file
Example:
■ Block URL
■ Row Group
■ Column Name
■ Column metadata (blob)
13. Bloom Filters
Problem: how to verify bits are set?
Solution: read bloom filters and process in application
Problem: ~50k blocks/day * 262kB = ~12GB of data, too much for one query
Solution:
■ chunk bloom filters and split into rows
■ by chunking per bloom filter block we read one row per token,
50k * 32 bytes / token = 1.6MB / token
14. Bloom Filters: Primary Key (1)
Primary key: ((block_url, row_group), chunk index)
~ 8192 chunks of 32 bytes per bloom filter = ~262kB per partition
Pros:
■ Easy to insert and delete, single batch query
Cons:
■ Need to know the block id before reading
■ A lot of partitions to access, 1 day: 50k partitions
15. Bloom Filters: Primary Key (2)
Primary key: ((table url, hour, chunk index), block url, row group)
~ 2000 chunks of 32 bytes per bloom filter = ~64kB per partition
Pros
■ Very fast listing, less partitions.
1 day, 5 tokens: 24 * 5 = 120 partitions
■ No dependency on block, can read in parallel
Cons
■ Expensive to insert and delete: 8192 partitions for a single block!
16. Bloom Filters: Future Approach
Investigate optimal chunking:
find middle ground between writing large enough chunks and reading unnecessary data
Can we use UDF’s with WebAssembly?
SELECT block_url, row_group
FROM bloom_filters
WHERE table_url = ? AND hour = ? AND bloom_filter_matches(bloom_filter, indexes)
■ Let ScyllaDB do the hard work
■ Don’t need to worry about amount of data we’re sending back to app
■ Code is already written in rust
17. Be Careful
It’s very much not SQL - try to avoid migrations (/bugs)
Solutions:
■ Rename columns?
■ Add new columns, UPDATE blocks SET query?
■ Truncate table and start over again
18. ScyllaDB: Ecosystem
Extensive usage of ScyllaDB libraries and components:
■ Written in rust on top of ScyllaDB-rust-driver
■ ScyllaDB Operator for k8s
■ ScyllaDB Monitoring
■ ScyllaDB Manager
From knowing ScyllaDB exists to production ready & terabytes of data in 2 months
19. Hardware
Cost is very important
3-node cluster:
■ 8 vCPU
■ 32 GiB memory
■ ARM/Graviton
■ EBS volumes (gp3)
■ 500 MBps bandwidth
■ 12k IOPS
20. Metastore: Block Listing
Largest cluster: 4-5 TB on each node, mostly for one customer
Writes:
■ p99 latency: <1ms
■ ~10k writes / s
Block listing:
■ Depends on query & whether we’re using bloom filters
■ for 1 hour: <20ms latency
■ for 1 day: <500ms latency
21. Metastore: Column Metadata
Reads:
■ p50 latency: 5ms
■ p99 latency: 100ms (when we timeout)
Issue: large amount of concurrent queries
Probably disk issue
22. Conclusion
■ Keep an eye on partition sizes
■ Think about read/write patterns
■ Very happy with block listing…
… but unpredictable tail latency for reading
column metadata
■ Probably shouldn’t use EBS :-)
23. Thank You
Stay in Touch
Dan Harris
dan@coralogix.com
@thinkharderdev
github.com/thinkharderdev
www.linkedin.com/in/dsh2va
github.com/sebver
sebastian.vercruysse@coralogix.com
www.linkedin.com/in/sebastian-vercruysse
Sebastian Vercruysse