SlideShare a Scribd company logo
1 of 46
Download to read offline
Planet-scale Data Ingestion Pipeline
Bigdam
PLAZMA TD Internal Day 2018/02/19
#tdtech
Satoshi Tagomori (@tagomoris)
Satoshi Tagomori (@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, Woothee, ...
Treasure Data, Inc.
Backend Team
• Design for Large Scale Data Ingestion
• Issues to Be Solved
• Re-designing Systems
• Re-designed Pipeline: Bigdam
• Consistency
• Scaling
Large Scale Data Ingestion:
Traditional Pipeline
Data Ingestion in Treasure Data
• Accept requests from clients
• td-agent
• TD SDKs (incl. HTTP requests w/ JSON)
• Format data into MPC1
• Store MPC1 files into Plazmadb
clients Data Ingestion Pipeline
json
msgpack.gz
MPC1
Plazmadb
Presto
Hive
Traditional Pipeline
• Streaming Import API for td-agent
• API Server (RoR), Temporary Storage (S3)
• Import task queue (perfectqueue), workers (Java)
• 1 msgpack.gz file in request → 1 MPC1 file on Plazmadb
td-agent
api-import
(RoR)
msgpack.gz
S3
PerfectQueue
Plazmadb
Import
Worker
msgpack.gz
MPC1
Traditional Pipeline: Event Collector
• APIs for TD SDKs
• Event Collector nodes (hosted Fluentd)
• on the top of Streaming Import API
• 1 MPC1 file on Plazmadb per 3min. per Fluentd process
TD SDKs
api-import
(RoR)
json
S3
PerfectQueue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-collector
(Fluentd)
msgpack.gz
Growing Traffics on Traditional Pipeline
• Throughput of perfectqueue
• Latency until queries via Event-Collector
• Maintaining Event-Collector code
• Many small temporary files on S3
• Many small imported files on Plazmadb on S3
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Perfectqueue Throughput Issue
• Perfectqueue
• "PerfectQueue is a highly available distributed queue built on top of
RDBMS."
• Fair scheduling
• https://github.com/treasure-data/perfectqueue
• Perfectqueue is NOT "perfect"...
• Need wide lock on table: poor concurrency
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Latency until Queries via Event-Collector
• Event-collector buffers data in its storage
• 3min. + α
• Customers have to wait 3+min. until a record become
visible on Plazmadb
• 1/2 buffering time make 2x MPC1 files
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Maintaining Event-Collector Code
• Mitsu says: "No problem about maintaining event-collector code"
• :P
• Event-collector processes HTTP requests in Ruby code
• Hard to test it
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Many Small Temporary Files on S3
• api-import uploads all requested msgpack.gz files to S3
• S3 outage is critical issue
• AWS S3 outage in us-east-1 at Feb 28th, 2017
• Many uploaded files makes costs expensive
• costs per object
• costs per operation
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Many Small Imported Files on Plazmadb on S3
• 1 MPC1 file on Plazmadb from 1 msgpack.gz file
• on Plazmadb realtime storage
• https://www.slideshare.net/treasure-data/td-techplazma
• Many MPC1 files:
• S3 request cost to store
• S3 request cost to fetch (from Presto, Hive)
• Performance regression to fetch many small files in queries

(256MB expected vs. 32MB actual)
Re-designing Systems
Make "Latency" Shorter (1)
• Clients to our endpoints
• JS SDK on customers' page sends data to our endpoints

from mobile devices
• Longer latency increases % of dropped records
• Many endpoints on the Earth: US, Asia + others
• Plazmadb in us-east-1 as "central location"
• Many geographically-parted "edge location"
Make "Latency" Shorter (2)
• Shorter waiting time to query records
• Flexible import task scheduling - better if configurable
• Decouple buffers from endpoint server processes
• More frequent import with aggregated buffers
bufferendpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
buffer
BEFORE
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
AFTER
MPC1
MPC1
MPC1
MPC1
MPC1
buffer
buffer
buffer
Redesigning Queues
• Fair scheduling is not required for import tasks
• Import tasks are FIFO (First In, First Out)
• Small payload - (apikey, account_id, database, table)
• More throughput
• Using Queue service + RDBMS
• Queue service for enqueuing/dequeuing
• RDBMS to provide at-least-once
S3-free Temporary Storage
• Make the pipeline free from S3 outage
• Distributed storage cluster as buffer for uploaded data (w/ replication)
• Buffer transferring between edge and central locations
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
clients
Edge location Central location
buffer
buffer
buffer
Storage

Cluster
Storage

Cluster
Merging Temporary Buffers into a File on Plazmadb
• Non-1-by-1 conversion from msgpack.gz to MPC1
• Buffers can be gathered using secondary index
• primary index: buffer_id
• secondary index: account_id, database, table, apikey
bufferendpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
buffer
BEFORE
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
AFTER
MPC1
MPC1
MPC1
MPC1
MPC1
buffer
buffer
buffer
Should It Provide Read-After-Write Consistency?
• BigQuery provides Read-After-Write consistency
• Pros: Inserted record can be queried now
• Cons:
• Much longer latency (especially from non-US regions)
• Much more expensive to host API servers for longer HTTP sessions
• Much more expensive to host Query nodes for smaller files on Plazmadb
• Much more troubles
• Say "No!" for it
Appendix
Bigdam
Bigdam: Planet-scale!

Edge locations on the earth + the Central location
Bigdam-Gateway (mruby on h2o)
• HTTP Endpoint servers
• Rack-like API for mruby handlers
• Easy to write, easy to test (!)
• Async HTTP requests from mruby, managed by h2o using Fiber
• HTTP/2 capability in future
• Handles all requests from td-agent and TD SDKs
• decode/authorize requests
• send data to storage nodes in parallel (to replicate)
Bigdam-Pool (Java)
• Distributed Storage for buffering
• Expected data size: 1KB (a json) ~ 32MB (a msgpack.gz from td-agent)
• Append data into a buffer
• Query buffers using secondary index
• Transfer buffers from edge to central
chunks
buffers
Central location
Over Internet
Using HTTPS or HTTP/2
Buffer committed
(size or timeout)Edge location
Import workers
account_id, database, table
Bigdam-Scheduler (Golang)
• Scheduler server
• Bigdam-pool requests to schedule import tasks to bigdam-scheduler

(many times in seconds)
• Bigdam-scheduler enqueues import tasks to bigdam-queue,

(once in configured interval: default 1min.)
bigdam-pool
nodes bigdam-queue
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-Scheduler
for every committed buffers
once in a minutes
per account/db/table
account_id, database, table, apikey
1. bigdam-pool requests to schedule import tasks for every buffers
2. requested task is added in scheduler entries, if missing
l
account1, db1, table1, apikeyA
scheduler entries
bigdam-pool
nodes
account9, db8, table7, apikeyB bigdam-queuel
3. schedule a task to be enqueued after timeout from entry creation
bigdam-pool
nodes bigdam-queue
tasks to be enqueued
l
account1, db1, table1, apikeyA
scheduler entries
bigdam-pool
nodes
bigdam-queuel
tasks to be enqueued
4. enqueue an import task into bigdam-queue
bigdam-pool
nodes
bigdam-queue
l
account1, db1, table1, apikeyA
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
account1, db1, table1, apikeyA
l
account1, db1, table1, apikeyA
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
account1, db1, table1, apikeyA
5. remove an entry in schedule if succeeded
l
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
bigdam-pool
nodes bigdam-queue
Bigdam-Queue (Java)
• High throughput queue for import tasks
• Enqueue/dequeue using AWS SQS (standard queue)
• Task state management using AWS Aurora
• Roughly ordered, At-least-once
enqueue tasks
bigdam-scheduler
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task
2. enqueue 1. INSERT INTO
task, enqueued
task, enqueued
task, enqueued
AWS Aurora
request to dequeue task
bigdam-import
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task
1. dequeue
task, enqueued
task, enqueued
task, running
2. UPDATE
AWS Aurora
finish
bigdam-import
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task, enqueued
task, enqueued
1. DELETE
AWS Aurora
Bigdam-Import (Java)
• Import worker
• Convert source (json/msgpack.gz) to MPC1
• Execute import tasks in parallel
• Dequeue tasks from bigdam-queue
• Query and download buffers from bigdam-pool
• Make a list of chunk ids and put it to bigdam-dddb
• Execute deduplication to determine chunks to be imported
• Make MPC1 files and put them into Plazmadb
Bigdam-Dddb (Java)
• Database service for deduplication
• Based on AWS Aurora and S3
• Stores unique chunk ids per import task

not to import same chunk twice
1. store chunk-id list (small)
bigdam-import
bigdam-dddb server (Java)
2. INSERT
task-id, list-of-chunk-ids
AWS Aurora
2. store task-id

and S3 object path
bigdam-import
bigdam-dddb server (Java)
3. INSERT
1. upload

encoded chunk-ids
task-id, path-of-ids
AWS AuroraAWS S3
list-of-chunk-idstask-id, list-of-chunk-ids
For small list of chunk ids For huge list of chunk ids
1. query lists of past tasks
bigdam-import
bigdam-dddb server (Java)
2. SELECT
task-id, path-of-ids
AWS Aurora
list-of-chunk-idstask-id, list-of-chunk-ids
Fetch chunk id lists imported in past
3. download

if needed
Consistency and Scaling
Executing Deduplication at the end of pipeline
• Make it simple & reliable
gateway
clients

(data input)
At-least-once everywhere
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
Deduplication
(Transaction + Retries)
At-Least-Once: Bigdam-pool Data Replication
Client-side replication:
client uploads 3 replica
to 3 nodes in parallel
Server-side replication:
primary node appends chunks to existing buffer,
and replicate them
(for equal contents/checksums in nodes)
for large chunks

(1MB~)
for small chunks

(~1MB)
At-Least-Once: Bigdam-pool Data Replication
Server replication
for transferred buffer
Scaling-out (almost) Everywhere
• Scalable components on EC2 (& ready for AWS autoscaling)
• AWS Aurora (w/o table locks) + AWS SQS (+ AWS S3)
gateway
clients

(data input)
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
scale-outscale-out
Scaling-up Just For A Case: Scheduler
• Scheduler need to collect notifications of all buffers
• and cannot be parallelized by nodes (in easy way)
• Solution: high-performant singleton server: 90k+ reqs/sec
gateway
clients

(data input)
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
singleton

server
Bigdam Current status: Under Testing
It's great fun
to design
Distributed Systems!
Thank you!
@tagomoris
We're Hiring!

More Related Content

What's hot

マルチテナントのアプリケーション実装〜実践編〜
マルチテナントのアプリケーション実装〜実践編〜マルチテナントのアプリケーション実装〜実践編〜
マルチテナントのアプリケーション実装〜実践編〜Yoshiki Nakagawa
 
マルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことマルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことAmazon Web Services Japan
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Kohei Tokunaga
 
LogbackからLog4j 2への移行によるアプリケーションのスループット改善 ( JJUG CCC 2021 Fall )
LogbackからLog4j 2への移行によるアプリケーションのスループット改善 ( JJUG CCC 2021 Fall ) LogbackからLog4j 2への移行によるアプリケーションのスループット改善 ( JJUG CCC 2021 Fall )
LogbackからLog4j 2への移行によるアプリケーションのスループット改善 ( JJUG CCC 2021 Fall ) Hironobu Isoda
 
DockerとPodmanの比較
DockerとPodmanの比較DockerとPodmanの比較
DockerとPodmanの比較Akihiro Suda
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpYahoo!デベロッパーネットワーク
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkTakanori Suzuki
 
AKS と ACI を組み合わせて使ってみた
AKS と ACI を組み合わせて使ってみたAKS と ACI を組み合わせて使ってみた
AKS と ACI を組み合わせて使ってみたHideaki Aoyagi
 
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティスAmazon Web Services Japan
 
ソーシャルゲーム案件におけるDB分割のPHP実装
ソーシャルゲーム案件におけるDB分割のPHP実装ソーシャルゲーム案件におけるDB分割のPHP実装
ソーシャルゲーム案件におけるDB分割のPHP実装infinite_loop
 
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話JustSystems Corporation
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI
 
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところY Watanabe
 
コンテナの作り方「Dockerは裏方で何をしているのか?」
コンテナの作り方「Dockerは裏方で何をしているのか?」コンテナの作り方「Dockerは裏方で何をしているのか?」
コンテナの作り方「Dockerは裏方で何をしているのか?」Masahito Zembutsu
 
コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門Kohei Tokunaga
 
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps OnlineGKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps OnlineGoogle Cloud Platform - Japan
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けRecruit Technologies
 

What's hot (20)

マルチテナントのアプリケーション実装〜実践編〜
マルチテナントのアプリケーション実装〜実践編〜マルチテナントのアプリケーション実装〜実践編〜
マルチテナントのアプリケーション実装〜実践編〜
 
マルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのことマルチテナント化で知っておきたいデータベースのこと
マルチテナント化で知っておきたいデータベースのこと
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
LogbackからLog4j 2への移行によるアプリケーションのスループット改善 ( JJUG CCC 2021 Fall )
LogbackからLog4j 2への移行によるアプリケーションのスループット改善 ( JJUG CCC 2021 Fall ) LogbackからLog4j 2への移行によるアプリケーションのスループット改善 ( JJUG CCC 2021 Fall )
LogbackからLog4j 2への移行によるアプリケーションのスループット改善 ( JJUG CCC 2021 Fall )
 
DockerとPodmanの比較
DockerとPodmanの比較DockerとPodmanの比較
DockerとPodmanの比較
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
 
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
At least onceってぶっちゃけ問題の先送りだったよね #kafkajpAt least onceってぶっちゃけ問題の先送りだったよね #kafkajp
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
 
IoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache FlinkIoT時代におけるストリームデータ処理と急成長の Apache Flink
IoT時代におけるストリームデータ処理と急成長の Apache Flink
 
AKS と ACI を組み合わせて使ってみた
AKS と ACI を組み合わせて使ってみたAKS と ACI を組み合わせて使ってみた
AKS と ACI を組み合わせて使ってみた
 
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
 
ソーシャルゲーム案件におけるDB分割のPHP実装
ソーシャルゲーム案件におけるDB分割のPHP実装ソーシャルゲーム案件におけるDB分割のPHP実装
ソーシャルゲーム案件におけるDB分割のPHP実装
 
ヤフー社内でやってるMySQLチューニングセミナー大公開
ヤフー社内でやってるMySQLチューニングセミナー大公開ヤフー社内でやってるMySQLチューニングセミナー大公開
ヤフー社内でやってるMySQLチューニングセミナー大公開
 
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
Spring Boot の Web アプリケーションを Docker に載せて AWS ECS で動かしている話
 
ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方ストリーム処理を支えるキューイングシステムの選び方
ストリーム処理を支えるキューイングシステムの選び方
 
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
 
コンテナの作り方「Dockerは裏方で何をしているのか?」
コンテナの作り方「Dockerは裏方で何をしているのか?」コンテナの作り方「Dockerは裏方で何をしているのか?」
コンテナの作り方「Dockerは裏方で何をしているのか?」
 
Vacuum徹底解説
Vacuum徹底解説Vacuum徹底解説
Vacuum徹底解説
 
コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門コンテナ未経験新人が学ぶコンテナ技術入門
コンテナ未経験新人が学ぶコンテナ技術入門
 
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps OnlineGKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
GKE に飛んでくるトラフィックを 自由自在に操る力 | 第 10 回 Google Cloud INSIDE Games & Apps Online
 
ビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分けビッグデータ処理データベースの全体像と使い分け
ビッグデータ処理データベースの全体像と使い分け
 

Similar to Planet-scale Data Ingestion Pipeline: Bigdam

Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMinsk MongoDB User Group
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48Jozef Képesi
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookTreasure Data, Inc.
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Danielle Womboldt
 
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...Ceph Community
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterJohn Adams
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudChangshu Liu
 

Similar to Planet-scale Data Ingestion Pipeline: Bigdam (20)

Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
 
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 

More from SATOSHI TAGOMORI

Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speedSATOSHI TAGOMORI
 
Good Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/OperationsGood Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/OperationsSATOSHI TAGOMORI
 
Invitation to the dark side of Ruby
Invitation to the dark side of RubyInvitation to the dark side of Ruby
Invitation to the dark side of RubySATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)SATOSHI TAGOMORI
 
Make Your Ruby Script Confusing
Make Your Ruby Script ConfusingMake Your Ruby Script Confusing
Make Your Ruby Script ConfusingSATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in RubyHijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in RubySATOSHI TAGOMORI
 
Lock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive OperationsLock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive OperationsSATOSHI TAGOMORI
 
Data Processing and Ruby in the World
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the WorldSATOSHI TAGOMORI
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessSATOSHI TAGOMORI
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage SystemsSATOSHI TAGOMORI
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd SeasonSATOSHI TAGOMORI
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToSATOSHI TAGOMORI
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In RubySATOSHI TAGOMORI
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real WorldModern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real WorldSATOSHI TAGOMORI
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceSATOSHI TAGOMORI
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra PerfectSATOSHI TAGOMORI
 

More from SATOSHI TAGOMORI (20)

Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speed
 
Good Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/OperationsGood Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/Operations
 
Maccro Strikes Back
Maccro Strikes BackMaccro Strikes Back
Maccro Strikes Back
 
Invitation to the dark side of Ruby
Invitation to the dark side of RubyInvitation to the dark side of Ruby
Invitation to the dark side of Ruby
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
 
Make Your Ruby Script Confusing
Make Your Ruby Script ConfusingMake Your Ruby Script Confusing
Make Your Ruby Script Confusing
 
Hijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in RubyHijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in Ruby
 
Lock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive OperationsLock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive Operations
 
Data Processing and Ruby in the World
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the World
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
 
Fluentd 101
Fluentd 101Fluentd 101
Fluentd 101
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real WorldModern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real World
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 

Recently uploaded

How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 

Recently uploaded (20)

How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Planet-scale Data Ingestion Pipeline: Bigdam

  • 1. Planet-scale Data Ingestion Pipeline Bigdam PLAZMA TD Internal Day 2018/02/19 #tdtech Satoshi Tagomori (@tagomoris)
  • 2. Satoshi Tagomori (@tagomoris) Fluentd, MessagePack-Ruby, Norikra, Woothee, ... Treasure Data, Inc. Backend Team
  • 3.
  • 4. • Design for Large Scale Data Ingestion • Issues to Be Solved • Re-designing Systems • Re-designed Pipeline: Bigdam • Consistency • Scaling
  • 5. Large Scale Data Ingestion: Traditional Pipeline
  • 6. Data Ingestion in Treasure Data • Accept requests from clients • td-agent • TD SDKs (incl. HTTP requests w/ JSON) • Format data into MPC1 • Store MPC1 files into Plazmadb clients Data Ingestion Pipeline json msgpack.gz MPC1 Plazmadb Presto Hive
  • 7. Traditional Pipeline • Streaming Import API for td-agent • API Server (RoR), Temporary Storage (S3) • Import task queue (perfectqueue), workers (Java) • 1 msgpack.gz file in request → 1 MPC1 file on Plazmadb td-agent api-import (RoR) msgpack.gz S3 PerfectQueue Plazmadb Import Worker msgpack.gz MPC1
  • 8. Traditional Pipeline: Event Collector • APIs for TD SDKs • Event Collector nodes (hosted Fluentd) • on the top of Streaming Import API • 1 MPC1 file on Plazmadb per 3min. per Fluentd process TD SDKs api-import (RoR) json S3 PerfectQueue Plazmadb Import Worker msgpack.gz MPC1 event-collector (Fluentd) msgpack.gz
  • 9. Growing Traffics on Traditional Pipeline • Throughput of perfectqueue • Latency until queries via Event-Collector • Maintaining Event-Collector code • Many small temporary files on S3 • Many small imported files on Plazmadb on S3 TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 10. Perfectqueue Throughput Issue • Perfectqueue • "PerfectQueue is a highly available distributed queue built on top of RDBMS." • Fair scheduling • https://github.com/treasure-data/perfectqueue • Perfectqueue is NOT "perfect"... • Need wide lock on table: poor concurrency TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 11. Latency until Queries via Event-Collector • Event-collector buffers data in its storage • 3min. + α • Customers have to wait 3+min. until a record become visible on Plazmadb • 1/2 buffering time make 2x MPC1 files TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 12. Maintaining Event-Collector Code • Mitsu says: "No problem about maintaining event-collector code" • :P • Event-collector processes HTTP requests in Ruby code • Hard to test it TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 13. Many Small Temporary Files on S3 • api-import uploads all requested msgpack.gz files to S3 • S3 outage is critical issue • AWS S3 outage in us-east-1 at Feb 28th, 2017 • Many uploaded files makes costs expensive • costs per object • costs per operation TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 14. Many Small Imported Files on Plazmadb on S3 • 1 MPC1 file on Plazmadb from 1 msgpack.gz file • on Plazmadb realtime storage • https://www.slideshare.net/treasure-data/td-techplazma • Many MPC1 files: • S3 request cost to store • S3 request cost to fetch (from Presto, Hive) • Performance regression to fetch many small files in queries
 (256MB expected vs. 32MB actual)
  • 16. Make "Latency" Shorter (1) • Clients to our endpoints • JS SDK on customers' page sends data to our endpoints
 from mobile devices • Longer latency increases % of dropped records • Many endpoints on the Earth: US, Asia + others • Plazmadb in us-east-1 as "central location" • Many geographically-parted "edge location"
  • 17. Make "Latency" Shorter (2) • Shorter waiting time to query records • Flexible import task scheduling - better if configurable • Decouple buffers from endpoint server processes • More frequent import with aggregated buffers bufferendpoint endpoint endpoint endpoint endpoint buffer buffer buffer buffer BEFORE MPC1 endpoint endpoint endpoint endpoint endpoint AFTER MPC1 MPC1 MPC1 MPC1 MPC1 buffer buffer buffer
  • 18. Redesigning Queues • Fair scheduling is not required for import tasks • Import tasks are FIFO (First In, First Out) • Small payload - (apikey, account_id, database, table) • More throughput • Using Queue service + RDBMS • Queue service for enqueuing/dequeuing • RDBMS to provide at-least-once
  • 19. S3-free Temporary Storage • Make the pipeline free from S3 outage • Distributed storage cluster as buffer for uploaded data (w/ replication) • Buffer transferring between edge and central locations MPC1 endpoint endpoint endpoint endpoint endpoint buffer buffer buffer clients Edge location Central location buffer buffer buffer Storage
 Cluster Storage
 Cluster
  • 20. Merging Temporary Buffers into a File on Plazmadb • Non-1-by-1 conversion from msgpack.gz to MPC1 • Buffers can be gathered using secondary index • primary index: buffer_id • secondary index: account_id, database, table, apikey bufferendpoint endpoint endpoint endpoint endpoint buffer buffer buffer buffer BEFORE MPC1 endpoint endpoint endpoint endpoint endpoint AFTER MPC1 MPC1 MPC1 MPC1 MPC1 buffer buffer buffer
  • 21. Should It Provide Read-After-Write Consistency? • BigQuery provides Read-After-Write consistency • Pros: Inserted record can be queried now • Cons: • Much longer latency (especially from non-US regions) • Much more expensive to host API servers for longer HTTP sessions • Much more expensive to host Query nodes for smaller files on Plazmadb • Much more troubles • Say "No!" for it Appendix
  • 23. Bigdam: Planet-scale!
 Edge locations on the earth + the Central location
  • 24.
  • 25.
  • 26. Bigdam-Gateway (mruby on h2o) • HTTP Endpoint servers • Rack-like API for mruby handlers • Easy to write, easy to test (!) • Async HTTP requests from mruby, managed by h2o using Fiber • HTTP/2 capability in future • Handles all requests from td-agent and TD SDKs • decode/authorize requests • send data to storage nodes in parallel (to replicate)
  • 27.
  • 28. Bigdam-Pool (Java) • Distributed Storage for buffering • Expected data size: 1KB (a json) ~ 32MB (a msgpack.gz from td-agent) • Append data into a buffer • Query buffers using secondary index • Transfer buffers from edge to central chunks buffers Central location Over Internet Using HTTPS or HTTP/2 Buffer committed (size or timeout)Edge location Import workers account_id, database, table
  • 29.
  • 30. Bigdam-Scheduler (Golang) • Scheduler server • Bigdam-pool requests to schedule import tasks to bigdam-scheduler
 (many times in seconds) • Bigdam-scheduler enqueues import tasks to bigdam-queue,
 (once in configured interval: default 1min.) bigdam-pool nodes bigdam-queue bigdam-pool nodes bigdam-pool nodes bigdam-pool nodes bigdam-pool nodes bigdam-Scheduler for every committed buffers once in a minutes per account/db/table
  • 31. account_id, database, table, apikey 1. bigdam-pool requests to schedule import tasks for every buffers 2. requested task is added in scheduler entries, if missing l account1, db1, table1, apikeyA scheduler entries bigdam-pool nodes account9, db8, table7, apikeyB bigdam-queuel 3. schedule a task to be enqueued after timeout from entry creation bigdam-pool nodes bigdam-queue tasks to be enqueued l account1, db1, table1, apikeyA scheduler entries bigdam-pool nodes bigdam-queuel tasks to be enqueued 4. enqueue an import task into bigdam-queue bigdam-pool nodes bigdam-queue l account1, db1, table1, apikeyA scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued account1, db1, table1, apikeyA l account1, db1, table1, apikeyA scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued account1, db1, table1, apikeyA 5. remove an entry in schedule if succeeded l scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued bigdam-pool nodes bigdam-queue
  • 32.
  • 33. Bigdam-Queue (Java) • High throughput queue for import tasks • Enqueue/dequeue using AWS SQS (standard queue) • Task state management using AWS Aurora • Roughly ordered, At-least-once enqueue tasks bigdam-scheduler bigdam-queue server (Java) AWS SQS (standard) task task task 2. enqueue 1. INSERT INTO task, enqueued task, enqueued task, enqueued AWS Aurora request to dequeue task bigdam-import bigdam-queue server (Java) AWS SQS (standard) task task task 1. dequeue task, enqueued task, enqueued task, running 2. UPDATE AWS Aurora finish bigdam-import bigdam-queue server (Java) AWS SQS (standard) task task task, enqueued task, enqueued 1. DELETE AWS Aurora
  • 34.
  • 35. Bigdam-Import (Java) • Import worker • Convert source (json/msgpack.gz) to MPC1 • Execute import tasks in parallel • Dequeue tasks from bigdam-queue • Query and download buffers from bigdam-pool • Make a list of chunk ids and put it to bigdam-dddb • Execute deduplication to determine chunks to be imported • Make MPC1 files and put them into Plazmadb
  • 36.
  • 37. Bigdam-Dddb (Java) • Database service for deduplication • Based on AWS Aurora and S3 • Stores unique chunk ids per import task
 not to import same chunk twice 1. store chunk-id list (small) bigdam-import bigdam-dddb server (Java) 2. INSERT task-id, list-of-chunk-ids AWS Aurora 2. store task-id
 and S3 object path bigdam-import bigdam-dddb server (Java) 3. INSERT 1. upload
 encoded chunk-ids task-id, path-of-ids AWS AuroraAWS S3 list-of-chunk-idstask-id, list-of-chunk-ids For small list of chunk ids For huge list of chunk ids 1. query lists of past tasks bigdam-import bigdam-dddb server (Java) 2. SELECT task-id, path-of-ids AWS Aurora list-of-chunk-idstask-id, list-of-chunk-ids Fetch chunk id lists imported in past 3. download
 if needed
  • 39. Executing Deduplication at the end of pipeline • Make it simple & reliable gateway clients
 (data input) At-least-once everywhere pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler Deduplication (Transaction + Retries)
  • 40. At-Least-Once: Bigdam-pool Data Replication Client-side replication: client uploads 3 replica to 3 nodes in parallel Server-side replication: primary node appends chunks to existing buffer, and replicate them (for equal contents/checksums in nodes) for large chunks
 (1MB~) for small chunks
 (~1MB)
  • 41. At-Least-Once: Bigdam-pool Data Replication Server replication for transferred buffer
  • 42. Scaling-out (almost) Everywhere • Scalable components on EC2 (& ready for AWS autoscaling) • AWS Aurora (w/o table locks) + AWS SQS (+ AWS S3) gateway clients
 (data input) pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler scale-outscale-out
  • 43. Scaling-up Just For A Case: Scheduler • Scheduler need to collect notifications of all buffers • and cannot be parallelized by nodes (in easy way) • Solution: high-performant singleton server: 90k+ reqs/sec gateway clients
 (data input) pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler singleton
 server
  • 44. Bigdam Current status: Under Testing
  • 45. It's great fun to design Distributed Systems! Thank you! @tagomoris