Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Planet-scale Data Ingestion Pipeline: Bigdam

3,695 views

Published on

Treasure Data Tech Talk PLAZMA: TD Internal Day #tdtech

Published in: Software
  • Be the first to comment

Planet-scale Data Ingestion Pipeline: Bigdam

  1. 1. Planet-scale Data Ingestion Pipeline Bigdam PLAZMA TD Internal Day 2018/02/19 #tdtech Satoshi Tagomori (@tagomoris)
  2. 2. Satoshi Tagomori (@tagomoris) Fluentd, MessagePack-Ruby, Norikra, Woothee, ... Treasure Data, Inc. Backend Team
  3. 3. • Design for Large Scale Data Ingestion • Issues to Be Solved • Re-designing Systems • Re-designed Pipeline: Bigdam • Consistency • Scaling
  4. 4. Large Scale Data Ingestion: Traditional Pipeline
  5. 5. Data Ingestion in Treasure Data • Accept requests from clients • td-agent • TD SDKs (incl. HTTP requests w/ JSON) • Format data into MPC1 • Store MPC1 files into Plazmadb clients Data Ingestion Pipeline json msgpack.gz MPC1 Plazmadb Presto Hive
  6. 6. Traditional Pipeline • Streaming Import API for td-agent • API Server (RoR), Temporary Storage (S3) • Import task queue (perfectqueue), workers (Java) • 1 msgpack.gz file in request → 1 MPC1 file on Plazmadb td-agent api-import (RoR) msgpack.gz S3 PerfectQueue Plazmadb Import Worker msgpack.gz MPC1
  7. 7. Traditional Pipeline: Event Collector • APIs for TD SDKs • Event Collector nodes (hosted Fluentd) • on the top of Streaming Import API • 1 MPC1 file on Plazmadb per 3min. per Fluentd process TD SDKs api-import (RoR) json S3 PerfectQueue Plazmadb Import Worker msgpack.gz MPC1 event-collector (Fluentd) msgpack.gz
  8. 8. Growing Traffics on Traditional Pipeline • Throughput of perfectqueue • Latency until queries via Event-Collector • Maintaining Event-Collector code • Many small temporary files on S3 • Many small imported files on Plazmadb on S3 TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  9. 9. Perfectqueue Throughput Issue • Perfectqueue • "PerfectQueue is a highly available distributed queue built on top of RDBMS." • Fair scheduling • https://github.com/treasure-data/perfectqueue • Perfectqueue is NOT "perfect"... • Need wide lock on table: poor concurrency TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  10. 10. Latency until Queries via Event-Collector • Event-collector buffers data in its storage • 3min. + α • Customers have to wait 3+min. until a record become visible on Plazmadb • 1/2 buffering time make 2x MPC1 files TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  11. 11. Maintaining Event-Collector Code • Mitsu says: "No problem about maintaining event-collector code" • :P • Event-collector processes HTTP requests in Ruby code • Hard to test it TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  12. 12. Many Small Temporary Files on S3 • api-import uploads all requested msgpack.gz files to S3 • S3 outage is critical issue • AWS S3 outage in us-east-1 at Feb 28th, 2017 • Many uploaded files makes costs expensive • costs per object • costs per operation TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  13. 13. Many Small Imported Files on Plazmadb on S3 • 1 MPC1 file on Plazmadb from 1 msgpack.gz file • on Plazmadb realtime storage • https://www.slideshare.net/treasure-data/td-techplazma • Many MPC1 files: • S3 request cost to store • S3 request cost to fetch (from Presto, Hive) • Performance regression to fetch many small files in queries
 (256MB expected vs. 32MB actual)
  14. 14. Re-designing Systems
  15. 15. Make "Latency" Shorter (1) • Clients to our endpoints • JS SDK on customers' page sends data to our endpoints
 from mobile devices • Longer latency increases % of dropped records • Many endpoints on the Earth: US, Asia + others • Plazmadb in us-east-1 as "central location" • Many geographically-parted "edge location"
  16. 16. Make "Latency" Shorter (2) • Shorter waiting time to query records • Flexible import task scheduling - better if configurable • Decouple buffers from endpoint server processes • More frequent import with aggregated buffers bufferendpoint endpoint endpoint endpoint endpoint buffer buffer buffer buffer BEFORE MPC1 endpoint endpoint endpoint endpoint endpoint AFTER MPC1 MPC1 MPC1 MPC1 MPC1 buffer buffer buffer
  17. 17. Redesigning Queues • Fair scheduling is not required for import tasks • Import tasks are FIFO (First In, First Out) • Small payload - (apikey, account_id, database, table) • More throughput • Using Queue service + RDBMS • Queue service for enqueuing/dequeuing • RDBMS to provide at-least-once
  18. 18. S3-free Temporary Storage • Make the pipeline free from S3 outage • Distributed storage cluster as buffer for uploaded data (w/ replication) • Buffer transferring between edge and central locations MPC1 endpoint endpoint endpoint endpoint endpoint buffer buffer buffer clients Edge location Central location buffer buffer buffer Storage
 Cluster Storage
 Cluster
  19. 19. Merging Temporary Buffers into a File on Plazmadb • Non-1-by-1 conversion from msgpack.gz to MPC1 • Buffers can be gathered using secondary index • primary index: buffer_id • secondary index: account_id, database, table, apikey bufferendpoint endpoint endpoint endpoint endpoint buffer buffer buffer buffer BEFORE MPC1 endpoint endpoint endpoint endpoint endpoint AFTER MPC1 MPC1 MPC1 MPC1 MPC1 buffer buffer buffer
  20. 20. Should It Provide Read-After-Write Consistency? • BigQuery provides Read-After-Write consistency • Pros: Inserted record can be queried now • Cons: • Much longer latency (especially from non-US regions) • Much more expensive to host API servers for longer HTTP sessions • Much more expensive to host Query nodes for smaller files on Plazmadb • Much more troubles • Say "No!" for it Appendix
  21. 21. Bigdam
  22. 22. Bigdam: Planet-scale!
 Edge locations on the earth + the Central location
  23. 23. Bigdam-Gateway (mruby on h2o) • HTTP Endpoint servers • Rack-like API for mruby handlers • Easy to write, easy to test (!) • Async HTTP requests from mruby, managed by h2o using Fiber • HTTP/2 capability in future • Handles all requests from td-agent and TD SDKs • decode/authorize requests • send data to storage nodes in parallel (to replicate)
  24. 24. Bigdam-Pool (Java) • Distributed Storage for buffering • Expected data size: 1KB (a json) ~ 32MB (a msgpack.gz from td-agent) • Append data into a buffer • Query buffers using secondary index • Transfer buffers from edge to central chunks buffers Central location Over Internet Using HTTPS or HTTP/2 Buffer committed (size or timeout)Edge location Import workers account_id, database, table
  25. 25. Bigdam-Scheduler (Golang) • Scheduler server • Bigdam-pool requests to schedule import tasks to bigdam-scheduler
 (many times in seconds) • Bigdam-scheduler enqueues import tasks to bigdam-queue,
 (once in configured interval: default 1min.) bigdam-pool nodes bigdam-queue bigdam-pool nodes bigdam-pool nodes bigdam-pool nodes bigdam-pool nodes bigdam-Scheduler for every committed buffers once in a minutes per account/db/table
  26. 26. account_id, database, table, apikey 1. bigdam-pool requests to schedule import tasks for every buffers 2. requested task is added in scheduler entries, if missing l account1, db1, table1, apikeyA scheduler entries bigdam-pool nodes account9, db8, table7, apikeyB bigdam-queuel 3. schedule a task to be enqueued after timeout from entry creation bigdam-pool nodes bigdam-queue tasks to be enqueued l account1, db1, table1, apikeyA scheduler entries bigdam-pool nodes bigdam-queuel tasks to be enqueued 4. enqueue an import task into bigdam-queue bigdam-pool nodes bigdam-queue l account1, db1, table1, apikeyA scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued account1, db1, table1, apikeyA l account1, db1, table1, apikeyA scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued account1, db1, table1, apikeyA 5. remove an entry in schedule if succeeded l scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued bigdam-pool nodes bigdam-queue
  27. 27. Bigdam-Queue (Java) • High throughput queue for import tasks • Enqueue/dequeue using AWS SQS (standard queue) • Task state management using AWS Aurora • Roughly ordered, At-least-once enqueue tasks bigdam-scheduler bigdam-queue server (Java) AWS SQS (standard) task task task 2. enqueue 1. INSERT INTO task, enqueued task, enqueued task, enqueued AWS Aurora request to dequeue task bigdam-import bigdam-queue server (Java) AWS SQS (standard) task task task 1. dequeue task, enqueued task, enqueued task, running 2. UPDATE AWS Aurora finish bigdam-import bigdam-queue server (Java) AWS SQS (standard) task task task, enqueued task, enqueued 1. DELETE AWS Aurora
  28. 28. Bigdam-Import (Java) • Import worker • Convert source (json/msgpack.gz) to MPC1 • Execute import tasks in parallel • Dequeue tasks from bigdam-queue • Query and download buffers from bigdam-pool • Make a list of chunk ids and put it to bigdam-dddb • Execute deduplication to determine chunks to be imported • Make MPC1 files and put them into Plazmadb
  29. 29. Bigdam-Dddb (Java) • Database service for deduplication • Based on AWS Aurora and S3 • Stores unique chunk ids per import task
 not to import same chunk twice 1. store chunk-id list (small) bigdam-import bigdam-dddb server (Java) 2. INSERT task-id, list-of-chunk-ids AWS Aurora 2. store task-id
 and S3 object path bigdam-import bigdam-dddb server (Java) 3. INSERT 1. upload
 encoded chunk-ids task-id, path-of-ids AWS AuroraAWS S3 list-of-chunk-idstask-id, list-of-chunk-ids For small list of chunk ids For huge list of chunk ids 1. query lists of past tasks bigdam-import bigdam-dddb server (Java) 2. SELECT task-id, path-of-ids AWS Aurora list-of-chunk-idstask-id, list-of-chunk-ids Fetch chunk id lists imported in past 3. download
 if needed
  30. 30. Consistency and Scaling
  31. 31. Executing Deduplication at the end of pipeline • Make it simple & reliable gateway clients
 (data input) At-least-once everywhere pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler Deduplication (Transaction + Retries)
  32. 32. At-Least-Once: Bigdam-pool Data Replication Client-side replication: client uploads 3 replica to 3 nodes in parallel Server-side replication: primary node appends chunks to existing buffer, and replicate them (for equal contents/checksums in nodes) for large chunks
 (1MB~) for small chunks
 (~1MB)
  33. 33. At-Least-Once: Bigdam-pool Data Replication Server replication for transferred buffer
  34. 34. Scaling-out (almost) Everywhere • Scalable components on EC2 (& ready for AWS autoscaling) • AWS Aurora (w/o table locks) + AWS SQS (+ AWS S3) gateway clients
 (data input) pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler scale-outscale-out
  35. 35. Scaling-up Just For A Case: Scheduler • Scheduler need to collect notifications of all buffers • and cannot be parallelized by nodes (in easy way) • Solution: high-performant singleton server: 90k+ reqs/sec gateway clients
 (data input) pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler singleton
 server
  36. 36. Bigdam Current status: Under Testing
  37. 37. It's great fun to design Distributed Systems! Thank you! @tagomoris
  38. 38. We're Hiring!

×