Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Making KVS 10x Scalable

2,596 views

Published on

Presented at PLAZMA TD Tech Talk SHIBUYA on 2018-10-16.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Making KVS 10x Scalable

  1. 1. Making KVS 10x Scalable Sadayuki Furuhashi PLAZMA TD Tech Talk 2018 at Shibuya リアルタイム配信サーバを10倍最適化した方法 Senior Principal Engineer @frsyuki
  2. 2. About Sadayuki Furuhashi A founder of Treasure Data, Inc. Located in Silicon Valley, USA. OSS Hacker. Github: @frsyuki OSS projects I initially designed:
  3. 3. What's CDP KVS
  4. 4. What's CDP KVS? ✓ Streaming data collection ✓ Bulk data collection CDP KVS (Today's topic) ✓ On-demand data delivery
  5. 5. What's CDP KVS? CDP KVS Source data tables Data collection in many ways Audience data set customers behaviors Segment data sets US, JP, (EU) Mobile, PC, Devices Segmentation workflows Preprocess workflows
  6. 6. JavaScript & Mobile Personalization API
  7. 7. REST API call (using JavaScript SDK) Returned value
  8. 8. Architecture (Old) CDP KVS Server CDP KVS Server AWS JP AWS US DynamoDBDAX DAX: DynamoDB's
 write-through cache Ignite Ignite: Distributed cache PrestoBulk write Random lookup by browsers / mobile
  9. 9. Challenges to solve
  10. 10. DynamoDB's auto-scaling doesn't scale in time Request failure! Load spikes right after noon
  11. 11. Expensive Write Capacity Cost Request failure! Already too expensive Bigger margin = even more expensive
  12. 12. Workload analysis Read Write API Random lookup by ID Bulk write & append
 No delete Temporal Locality (時間的局所性) High (repeating visitors) Low (daily or hourly batch) Spacial Locality (空間的局所性) Moderate (hot & cold data sets) High (rewrite data sets by batch) Number of records: 300 billion (3,000億件) Size of a record: 10 bytes Size of total records: 3 TB Read traffic: 50 requests/sec
  13. 13. Ideas (A) Alternative distributed KVS (Aerospike) (B) Storage Hierarchy on KVS (C) Edit log shipping & Indexed archive
  14. 14. Idea (A) Alternative Distributed KVS
  15. 15. (A) Alternative Distributed KVS CDP KVS Server CDP KVS Server DynamoDBDAX Ignite Presto Presto Aerospike node Aerospike node Aerospike node
  16. 16. Aerospike: Pros & Cons • Good: Very fast lookup • In-memory index + Direct IO on SSDs • Bad: Expensive (hardware & operation) • Same cost for both cold & hot data
 (Large memory overhead for cold data) • No spacial locality for write
 (a batch-write becomes random-writes)
  17. 17. SSD /dev/sdb Aerospike: Storage Architecture Aerospike node DRAM hash(k01):
 addr 01, size=3 ... hash(k02):
 addr 09, size=3 hash(k03):
 addr 76, size=3 k01 = v01 k02 = v02 k03 = v03 addr 01: addr 09: addr 76: GET hash(k01) ✓ Primary keys (hash) are always in-memory => Always fast lookup ✓ Data is always on SSD => Always durable ✓ IO on SSD is direct IO (no filesystem cache) => Consistently fast without warm-up Load index
 at startup (cold-start)
  18. 18. Aerospike: System Architecture { k01: v01 k02: v02 k03: v03 k04: v04 k05: v05 } { k06: v06 k07: v07 k08: v08 k09: v09 k0a: v0a } Aerospike node Aerospike node Aerospike node Aerospike node hash(key) = Node ID Aerospike node Aerospike node Bulk write 1 Bulk write 2 Batch write => Random write: No locality, No compression, More overhead Note: compressing 10-byte data isn't efficient
  19. 19. Aerospike: Cost estimation • 1 record needs 64 bytes of DRAM for primary key indexing • Storing 100 billion records (our use case) needs
 6.4 TB of DRAM. • With replication-factor=3, our system needs
 19.2TB of DRAM. • It needs r5.24xlarge × 26 instances on EC2. • It costs $89,000/month (1-year reserved, convertible). • Cost structure: • Very high DRAM cost per GB • Moderate IOPS cost • Low storage & CPU cost • High operational cost
  20. 20. Idea (B) Storage Hierarchy on KVS
  21. 21. Analyzing a cause of expensive DynamoDB WCU PK Col1 Col2 Key1 Key1 Col1 Key1 Col2 Key1 Key1 Col1 Key1 Col2 1KB 1KB 1KB 1KB 3.2KB Consumes 4 Write Capacity (0.8 WCU wasted)
  22. 22. DynamoDB with record size <<< 1KB PK Value Key1 Val1 Key2 Key3 Key4 Val2 Val3 Val4 1KB => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity 10 bytes 4 Write Capacity consumed to store 40 bytes. 99% WCU wasted!
  23. 23. Solution: Optimizing DynamoDB WCU overhead PK Value Key1 Val1 Key2 Key3 Key4 Val2 Val3 Val4 10 bytes 10 bytes 10 bytes 10 bytes => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity => 1 Write Capacity PK Value Part ID {Key1: Val1, Key2: Val2, Key3: Val3, Key4: Val4} 30 bytes => 1 Write Capacity (Note: expected 5x - 10x compression ratio)
  24. 24. (B) Storage Hierarchy on KVS { k01: v01 k03: v03 k06: v06 k08: v08 k0a: v0a } { k02: v02 k04: v04 k05: v05 k07: v07 k09: v09 } Bulk write 1 Bulk write 2 Compress & Write DynamoDBDAX hash(partition id) = Primary key
  25. 25. Storage Hierarchy on KVS: Pros & Cons • Good: Very scalable write & storage cost • Data compression (10x less write & storage cost) • Fewer number of primary keys
 (1 / 100,000 with 100k records in a partition) • Bad: Complex to understand & use • More difficult to understand • Writer (Presto) must partition data by partition id
  26. 26. Data partitioning - write k01: v01 k03: v03
 k06: v06 k08: v08
 k0a: v0a k02: v02
 k04: v04 k05: v05 k07: v07
 k09: v09 { k01: v01 k02: v02 k03: v03 k04: v04 k05: v05 k06: v06 k07: v07 k08: v08 k09: v09 k0a: v0a } Original data set Partition id=71 Partition=69 Encoded Partition id=71 k01 v01 k03 v03 k06 v06 k08 v08 k0a v0a Encode & Compress Split 1 Split 2 Split 3 PK Split 1 Split 2 Split 3 71 69 Store hash(key) = partition id | split id Partitioning using Presto
 (GROUP BY + array_agg query) DynamoDB
  27. 27. DynamoDB Data partitioning - read PK Split 1 Split 2 Split 3 71 69 Get hash(key) = partition id | split id GET k06 k06 is at: partition id=71 split id=2 k03 v03 k06 v06 Scan { k06: v06 } DAX (cache) Encoded split
  28. 28. Idea (C) Edit log shipping & Indexed Archive
  29. 29. (C) Edit log shipping & Indexed Archive Kafka / Kinesis (+ S3) Writer API NodeWriter API Stream of bulk-write data sets Indexing &
 Storage Node RocksDB Shard
 0, 1 Indexing &
 Storage Node RocksDB Shard
 1, 2 Indexing &
 Storage Node RocksDB Shard
 2, 3 Indexing &
 Storage Node RocksDB Shard
 3, 0 Writer API NodeReader API etcd, consul Shard & node list
 management Write Read Bulk-write S3 for backup & cold-start Subscribe Read
  30. 30. Architecture of RocksDB Optimization of RocksDB for Redis on Flash, Keren Ouaknine, Oran Agra, and Zvika Guz
  31. 31. Pros & Cons • Good: Very scalable write & storage cost • Data compression (10x less write & storage cost) • Bad: Expensive to implement & operate • Implementing 3 custom server components
 (Stateless: Writer, Reader. Stateful: Storage) • Operating stateful servers - more work to implement backup, restoring, monitoring, alerting, etc. • Others: • Flexible indexing • Eventually-consistent
  32. 32. Our decision: Storage Hierarchy on DynamoDB • Operating stateful servers is harder than you think! • Note: almost all Treasure Data components are stateless (or cache or temporary buffer) • Even if data format becomes complicated, stateless servers on DynamoDB is better option for us.
  33. 33. Appendix: Split format PK Split 1 Split 2 Split 3 71 69 { k03: v03, k06: v06, ... } msgpack( [ [keyLen 1, keyLen 2, keyLen 3, ...], "key1key2key3...", [valLen 1, valLen 2, valLen 3, ...], "val1val2val3...", ] ) zstd( msgpack( [ , msgpack( [ [keyLen, keyLen, keyLen, ...], "keykeykey...", [valLen, valLen, valLen, ...], "valvalval...", ] ) , ... ] ) , bucket 1 bucket 2 bucket NHash table
 serialized by MessagePack
 compressed by Zstd Size of a split: approx. 200KB (100,000 records) Nested MessagePack to omit unnecessary deserialization when looking up a record
  34. 34. Results
  35. 35. Bulk write performance 6x less total time 8x faster single bulk-write (which loops 18 times)
  36. 36. DynamoDB Write Capacity Consumption 210 105 921k Write Capacity in 45 minutes. 170 Write Capacity per second average (≒ 170 WCU).
  37. 37. What's Next? • Implementation => DONE • Testing => DONE • Deploying => on-going • Designing Future Extensions => FUTURE WORK
  38. 38. A possible future work Read Write API Random lookup by ID Random write Temporal Locality (時間的局所性) High (repeating visitors) Low => High? Spacial Locality (空間的局所性) Moderate (hot & cold data sets) High => Low? Extension for streaming computation (=> An on-demand read operation updates a value)

×