Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Plazma - Treasure Data’s distributed analytical database -

7,943 views

Published on

Treasure Data Tech Talk 2015-03-11, 2015-03-12

Published in: Engineering
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Plazma - Treasure Data’s distributed analytical database -

  1. 1. Sadayuki Furuhashi Founder & Software Architect Treasure Data, inc. PlazmaTreasure Data’s distributed analytical database growing 40,000,000,000 records/day.
  2. 2. Plazma - Treasure Data’s distributed analytical database
  3. 3. Plazma by the numbers > Data importing > 450,000 records/sec
 ≒ 40 billion records/day > Query processing using Hive > 2 trillion records/day > 2,828 TB/day
  4. 4. Today’s talk 1. Data importing > Realtime Storage & Archive Storage > Deduplication 2. Data processing > Column-oriented IO > Schmema-on-read > Schema auto detection 3. Transaction & Metadata > Schema auto detection > INSERT INTO
  5. 5. 1. Data Importing
  6. 6. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 1 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} … MySQL 
 (PerfectQueue)
  7. 7. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 1 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL 
 (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38
  8. 8. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 1 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL 
 (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)
  9. 9. Import Queue Import Worker Import Worker Import Worker ✓ HA ✓ Load balancing
  10. 10. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker Archive Storage
  11. 11. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … Archive Storage Metadata of the records in a file (stored on PostgreSQL)
  12. 12. Amazon S3 / Basho Riak CS Metadata Merge Worker
 (MapReduce) uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)
  13. 13. Amazon S3 / Basho Riak CS Metadata uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL GiST (R-tree) Index on“time” column on the files Read from Archive Storage if merged. Otherwise, from Realtime Storage
  14. 14. Data Importing > Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage > Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the files every 1 hour. > Metadata > Index is built on PostgreSQL using RANGE type and
 GiST index
  15. 15. 2. Data processing
  16. 16. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … MessagePack Columnar
 File Format
  17. 17. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL
  18. 18. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00
 GROUP BY code
  19. 19. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … … MessagePack Columnar
 File Format is schema-less ✓ Instant schema change SQL is schema-full ✓ SQL doesn’t work
 without schema Schema-on-Read
  20. 20. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage Schema-full Schema-less Schema {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (
 user INT, name STRING, value INT, host INT ); | user | 54 | name | “plazma” | value | 120 | host | NULL | | Schema-on-Read
  21. 21. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage Schema-full Schema-less Schema {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (
 user INT, name STRING, value INT, host INT ); | user | 54 | name | “plazma” | value | 120 | host | NULL | | Schema-on-Read
  22. 22. 2. Transaction & Metadata
  23. 23. Plazma’s Transaction API > getOrCreateMetadataTransaction(uniqueName) > start a named transaction. > if already started, abort the previous one and restart. > putOrOverwriteTransactoinPartition(name) > insert a file to the transaction. > if the file already exists, overwrite it. > commitMetadataTransaction(uniqueName) > make the inserted files visible. > If the transaction is already committed before, do nothing.
  24. 24. Presto
 worker Presto
 coordinator Presto
 worker Example: INSERT INTO impl. to Presto Metadata Archive Storage Plazma 1. getOrCreateMetadataTransaction 3. commitMetadataTransaction 2. putOrOverwriteTransactoinPartition(name) Retrying + Unique (at-least-once + at-most-once)
  25. 25. Reducer Hive
 QueryRunner Reducer Example: INSERT INTO impl. to Hive Metadata Archive Storage Plazma 1. getOrCreateMetadataTransaction 3. commitMetadataTransaction 2. putOrOverwriteTransactoinPartition(name) Retrying + Unique (at-least-once + at-most-once)
  26. 26. Hive
 QueryRunner Example: INSERT INTO impl. to Hive, rewriting query plan Hive
 QueryRunner Reducer Reducer Mapper Mapper Reducer Reducer Mapper Mapper Reducer Reducer Mapper Mapper Rewrite query plan Partitioning by time Files are not partitioned by time
  27. 27. Why not MySQL? - benchmark 0 45 90 135 180 INSERT 50,000 rows SELECT sum(id) SELECT sum(file_size) WHERE index range 0.656.578.79 168 3.66 17.2 MySQL PostgreSQL (seconds) Index-only scan GiST index + range type
  28. 28. Metadata optimization > Partitioning & TRUNCATE > DELETE produces many garbage rows and large WAL > TRUNCATE doesn’t > PostgreSQL parameters > random_page_cost == seq_page_cost > statement_timeout = 60 sec > hot_standby_feedback = 1
  29. 29. 1. Backend Engineer 2. Support Engineer 3. OSS Engineer
 (日本,東京,丸の内) We’re hiring!

×