How to create Treasure Data #dotsbigdata

Masahiro Nakagawa
August 1, 2015
BigData All Stars 2015
How to create
Treasure Data
#dotsbigdata

Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd / td-agent developer
> I love OSS :)
> D language - Phobos committer
> Fluentd - Main maintainer
> MessagePack / RPC - D and Python (only RPC)
> The organizer of Presto Source Code Reading / meetup
> etc…

Company overview
http://www.treasuredata.com/opensource
21 65

Treasure Data Solution
Ingest Analyze Distribute
74
and

Treasure Data Service
> A simpliﬁed cloud analytics infrastructure
> Customers focus on their business
> SQL interfaces for Schema-less data sources
> Fit for Data Hub / Lake
> Batch / Low latency / Machine Learning
> Lots of ingestion and integrated solutions
> Fluentd / Embulk / Data Connector / SDKs
> Result Output / Prestogres Gateway / BI tools
> Awesome support for time to value

Plazma - TD’s distributed analytical database

Plazma by the numbers
> Streaming import
> 45 billion records / day
> Bulk Import
> 10 billion records / day
> Hive Query
> 3+ trillion records / day
> Machine Learning queries, Hivemall, increased
> Presto Query
> 3+ trillion records / day

TD’s resource management
> Guarantee and boost compute resources
> Guarantee for stabilizing query performance
> Boost for sharing free resources
> Get multi-tenant merit
> Global resource schedular
> manage job, resource and priority across users
> Separate storage from compute resource
> Easy to scale workers
> We can use S3 / GCS / Azure Storage for reliable backend

Import
Queue
td-agent
/ ﬂuentd
Import
Worker
✓ Buffering for 
5 minute
✓ Retrying 
(at-least once)
✓ On-disk buffering
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
unique_id=375828ce5510cadb
{“time”:1426047906,”uid”:1,…}
{“time”:1426047912,”uid”:9,…}
{“time”:1426047939,”uid”:3,…}
{“time”:1426047951,”uid”:2,…}
…
MySQL  
(PerfectQueue)

Import
Queue
td-agent
/ ﬂuentd
Import
Worker
1 minute
✓ Retrying 
(at-least once)
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
MySQL  
(PerfectQueue)
unique_id time
375828ce5510cadb 2015-12-01 10:47
2024cffb9510cadc 2015-12-01 11:09
1b8d6a600510cadd 2015-12-01 11:21
1f06c0aa510caddb 2015-12-01 11:38

Import
Queue
td-agent
/ ﬂuentd
Import
Worker
5 minute
✓ Retrying 
(at-least once)
on failure
✓ Unique ID for
each chunk
API
Server
It’s like JSON.
but fast and small.
MySQL  
(PerfectQueue)
unique_id time
375828ce5510cadb 2015-12-01 10:47
2024cffb9510cadc 2015-12-01 11:09
1b8d6a600510cadd 2015-12-01 11:21
1f06c0aa510caddb 2015-12-01 11:38UNIQUE
(at-most once)

Import
Queue
Import
Worker
Import
Worker
Import
Worker
✓ HA
✓ Load balancing

Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
Archive
Storage

Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
uploaded time ﬁle index range records
2015-03-08 10:47
[2015-12-01 10:47:11, 
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32, 
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43, 
2015-12-01 11:40:49]
14
… … … …
Archive
Storage
Metadata of the
records in a ﬁle
(stored on
PostgreSQL)

Amazon S3 /
Basho Riak CS
Metadata
Merge Worker 
(MapReduce)
2015-03-08 10:47
[2015-12-01 10:47:11, 
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32, 
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43, 
2015-12-01 11:40:49]
14
… … … …
ﬁle index range records
[2015-12-01 10:00:00, 
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00, 
2015-12-01 12:00:00]
2,143
… … …
Realtime
Storage
Archive
Storage
PostgreSQL
Merge every 1 hourRetrying + Unique
(at-least-once + at-most-once)

Amazon S3 /
Basho Riak CS
Metadata
2015-03-08 10:47
[2015-12-01 10:47:11, 
2015-12-01 10:48:13]
3
2015-03-08 11:09
[2015-12-01 11:09:32, 
2015-12-01 11:10:35]
25
2015-03-08 11:38
[2015-12-01 11:38:43, 
2015-12-01 11:40:49]
14
… … … …
ﬁle index range records
[2015-12-01 10:00:00, 
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00, 
2015-12-01 12:00:00]
2,143
… … …
Realtime
Storage
Archive
Storage
PostgreSQL
GiST (R-tree) Index
on“time” column on the ﬁles
Read from Archive Storage if merged.
Otherwise, from Realtime Storage

Data Importing
> Scalable & Reliable importing
> Fluentd buffers data on a disk
> Import queue deduplicates uploaded chunks
> Workers take the chunks and put to Realtime Storage
> Instant visibility
> Imported data is immediately visible by query engines.
> Background workers merges the ﬁles every 1 hour.
> Metadata
> Index is built on PostgreSQL using RANGE type and 
GiST index

time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
Files on Amazon S3 / Basho Riak CS
Metadata on PostgreSQL
path index range records
[2015-12-01 10:00:00, 
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00, 
2015-12-01 12:00:00]
2,143
… … …
MessagePack Columnar 
File Format

time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
[2015-12-01 10:00:00, 
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00, 
2015-12-01 12:00:00]
2,143
… … …
column-based partitioning
time-based partitioning

time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
time code method
2015-12-01 11:10:09 200 GET
2015-12-01 11:21:45 200 GET
2015-12-01 11:38:59 200 GET
2015-12-01 11:43:37 200 GET
2015-12-01 11:54:52 “200” GET
… … …
Archive
Storage
[2015-12-01 10:00:00, 
2015-12-01 11:00:00]
3,312
[2015-12-01 11:00:00, 
2015-12-01 12:00:00]
2,143
… … …
column-based partitioning
time-based partitioning
SELECT code, COUNT(1) FROM logs
WHERE time >= 2015-12-01 11:00:00 
GROUP BY code

Handling Eventual Consistency
1. Writing data / metadata first
> At this time, data is not visible
2. Check data is available or not
> GET, GET, GET…
3. Data become visible
> Query includes imported data! 
Ex. Netflix case
> https://github.com/Netflix/s3mper

Hide network cost
> Open a lot of connections to Object Storage
> Using range feature with columnar offset
> Improve scan performance for partitioned data
> Detect recoverable error
> We have error lists for fault tolerance
> Stall checker
> Watch the progress of reading data
> If processing time reached threshold, re-connect to OS
and re-read data

buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
8
TableScanOperator
•  s3 file list
•  table schema
header
request
S3 / RiakCS
•  release(Buffer)
Buffer size limit
Reuse allocated buffers
Request Queue
•  priority queue
•  max connections limit
Header
Column Block 0
(column names)
Column Block 1
Column Block i
Column Block m
MPC1 file
HeaderReader
•  callback to HeaderParser
ColumnBlockReader
header
HeaderParser
•  parse MPC file header
• column block offsets
• column names
column block request
Column block requests
column block
prepare
MessageUnpacker
buffer
MessageUnpacker
MessageUnpacker
S3 read
S3 read
pull records
Retry GET request on
- 500 (internal error)
- 503 (slow down)
- 404 (not found)
- eventual consistency
S3 read•  decompression
•  msgpack-java v07
S3 read
S3 read
S3 read
Optimize scan performance

Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufﬁcient resource
> Exceeded task memory size
> Internal failure
> I/O error of S3 / Riak CS
> worker failure
> etc
We can retry these patterns

Presto retry on Internal Errors
> Query succeed eventually 
 
 
 
 
 
 
 
log scale

time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
user time code method
391 2015-12-01 11:10:09 200 GET
482 2015-12-01 11:21:45 200 GET
573 2015-12-01 11:38:59 200 GET
664 2015-12-01 11:43:37 200 GET
755 2015-12-01 11:54:52 “200” GET
… … …

time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 200 POST
… … …
user time code method
391 2015-12-01 11:10:09 200 GET
482 2015-12-01 11:21:45 200 GET
573 2015-12-01 11:38:59 200 GET
664 2015-12-01 11:43:37 200 GET
755 2015-12-01 11:54:52 “200” GET
… … …
MessagePack Columnar 
File Format is schema-less
✓ Instant schema change
SQL is schema-full
✓ SQL doesn’t work 
without schema
Schema-on-Read

Realtime
Storage
Query Engine 
Hive, Pig, Presto
Archive
Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”}
Schema-on-Read
Schema-full
Schema-less

Streaming logging layer
Reliable forwarding
Pluggable architecture
http://ﬂuentd.org/

Bulk loading
Parallel processing
Pluggable architecture
http://embulk.org/

Hadoop
> Distributed computing framework
> Consist of many components… 
 
 
 
 
 
 
http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/

Presto
>
> Open sourced by Facebook
> https://github.com/facebook/presto 
 
 
A distributed SQL query engine 
for interactive data analisys 
against GBs to PBs of data.

Conclusion
> Build scalable data analytics platform on Cloud
> Separate resource and storage
> loosely-coupled components
> We have lots of useful OSS and services :)
> There are many trade-off
> Use existing component or create new component?
> Stick to the basics!
> If you tired, please use Treasure Data ;)

https://jobs.lever.co/treasure-data
Cloud service for the entire data pipeline.

How to create Treasure Data #dotsbigdata

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to create Treasure Data #dotsbigdata

Similar to How to create Treasure Data #dotsbigdata (20)

More from N Masahiro

More from N Masahiro (20)

Recently uploaded

Recently uploaded (20)

How to create Treasure Data #dotsbigdata