Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Masahiro Nakagawa
August 1, 2015
BigData All Stars 2015
How to create
Treasure Data
#dotsbigdata
Who are you?
> Masahiro Nakagawa
> github/twitter: @repeatedly
> Treasure Data, Inc.
> Senior Software Engineer
> Fluentd ...
Company overview
http://www.treasuredata.com/opensource
21 65
Treasure Data Solution
Ingest Analyze Distribute
74
and
Treasure Data Service
> A simplified cloud analytics infrastructure
> Customers focus on their business
> SQL interfaces fo...
21 65
Plazma - TD’s distributed analytical database
Plazma by the numbers
> Streaming import
> 45 billion records / day
> Bulk Import
> 10 billion records / day
> Hive Query
...
TD’s resource management
> Guarantee and boost compute resources
> Guarantee for stabilizing query performance
> Boost for...
Data Importing
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

5 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on ...
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

1 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on ...
Import
Queue
td-agent
/ fluentd
Import
Worker
✓ Buffering for

5 minute
✓ Retrying

(at-least once)
✓ On-disk buffering
on ...
Import
Queue
Import
Worker
Import
Worker
Import
Worker
✓ HA
✓ Load balancing
Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
Arch...
Realtime
Storage
PostgreSQL
Amazon S3 /
Basho Riak CS
Metadata
Import
Queue
Import
Worker
Import
Worker
Import
Worker
uplo...
Amazon S3 /
Basho Riak CS
Metadata
Merge Worker

(MapReduce)
uploaded time file index range records
2015-03-08 10:47
[2015-...
Amazon S3 /
Basho Riak CS
Metadata
uploaded time file index range records
2015-03-08 10:47
[2015-12-01 10:47:11,

2015-12-0...
Data Importing
> Scalable & Reliable importing
> Fluentd buffers data on a disk
> Import queue deduplicates uploaded chunk...
Data processing
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
Handling Eventual Consistency
1. Writing data / metadata first
> At this time, data is not visible
2. Check data is availab...
Hide network cost
> Open a lot of connections to Object Storage
> Using range feature with columnar offset
> Improve scan ...
buffer
Optimizing Scan Performance
•  Fully utilize the network bandwidth from S3
•  TD Presto becomes CPU bottleneck
8
Ta...
Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufficient resource
> Exceeded task memory...
Recoverable errors
> Error types
> User error
> Syntax error, Semantic error
> Insufficient resource
> Exceeded task memory...
Presto retry on Internal Errors
> Query succeed eventually















log scale
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
time code method
2015-12-01 10:02:36 200 GET
2015-12-01 10:22:09 404 GET
2015-12-01 10:36:45 200 GET
2015-12-01 10:49:21 2...
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local...
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
Schema-full
Schema-less
Schema
{“user”:54, “name”:”plazma...
Realtime
Storage
Query Engine

Hive, Pig, Presto
Archive
Storage
{“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local...
Streaming logging layer
Reliable forwarding
Pluggable architecture
http://fluentd.org/
Bulk loading
Parallel processing
Pluggable architecture
http://embulk.org/
Hadoop
> Distributed computing framework
> Consist of many components…













http://hortonworks.com/hadoop-tutorial...
Presto
>
> Open sourced by Facebook
> https://github.com/facebook/presto





A distributed SQL query engine

for interact...
Conclusion
> Build scalable data analytics platform on Cloud
> Separate resource and storage
> loosely-coupled components
...
https://jobs.lever.co/treasure-data
Cloud service for the entire data pipeline.
Upcoming SlideShare
Loading in …5
×

How to create Treasure Data #dotsbigdata

4,107 views

Published on

http://eventdots.jp/event/562221

Published in: Engineering

How to create Treasure Data #dotsbigdata

  1. 1. Masahiro Nakagawa August 1, 2015 BigData All Stars 2015 How to create Treasure Data #dotsbigdata
  2. 2. Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC - D and Python (only RPC) > The organizer of Presto Source Code Reading / meetup > etc…
  3. 3. Company overview http://www.treasuredata.com/opensource 21 65
  4. 4. Treasure Data Solution Ingest Analyze Distribute 74 and
  5. 5. Treasure Data Service > A simplified cloud analytics infrastructure > Customers focus on their business > SQL interfaces for Schema-less data sources > Fit for Data Hub / Lake > Batch / Low latency / Machine Learning > Lots of ingestion and integrated solutions > Fluentd / Embulk / Data Connector / SDKs > Result Output / Prestogres Gateway / BI tools > Awesome support for time to value
  6. 6. 21 65
  7. 7. Plazma - TD’s distributed analytical database
  8. 8. Plazma by the numbers > Streaming import > 45 billion records / day > Bulk Import > 10 billion records / day > Hive Query > 3+ trillion records / day > Machine Learning queries, Hivemall, increased > Presto Query > 3+ trillion records / day
  9. 9. TD’s resource management > Guarantee and boost compute resources > Guarantee for stabilizing query performance > Boost for sharing free resources > Get multi-tenant merit > Global resource schedular > manage job, resource and priority across users > Separate storage from compute resource > Easy to scale workers > We can use S3 / GCS / Azure Storage for reliable backend
  10. 10. Data Importing
  11. 11. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 5 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. unique_id=375828ce5510cadb {“time”:1426047906,”uid”:1,…} {“time”:1426047912,”uid”:9,…} {“time”:1426047939,”uid”:3,…} {“time”:1426047951,”uid”:2,…} … MySQL 
 (PerfectQueue)
  12. 12. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 1 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL 
 (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38
  13. 13. Import Queue td-agent / fluentd Import Worker ✓ Buffering for
 5 minute ✓ Retrying
 (at-least once) ✓ On-disk buffering on failure ✓ Unique ID for each chunk API Server It’s like JSON. but fast and small. MySQL 
 (PerfectQueue) unique_id time 375828ce5510cadb 2015-12-01 10:47 2024cffb9510cadc 2015-12-01 11:09 1b8d6a600510cadd 2015-12-01 11:21 1f06c0aa510caddb 2015-12-01 11:38UNIQUE (at-most once)
  14. 14. Import Queue Import Worker Import Worker Import Worker ✓ HA ✓ Load balancing
  15. 15. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker Archive Storage
  16. 16. Realtime Storage PostgreSQL Amazon S3 / Basho Riak CS Metadata Import Queue Import Worker Import Worker Import Worker uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … Archive Storage Metadata of the records in a file (stored on PostgreSQL)
  17. 17. Amazon S3 / Basho Riak CS Metadata Merge Worker
 (MapReduce) uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL Merge every 1 hourRetrying + Unique (at-least-once + at-most-once)
  18. 18. Amazon S3 / Basho Riak CS Metadata uploaded time file index range records 2015-03-08 10:47 [2015-12-01 10:47:11,
 2015-12-01 10:48:13] 3 2015-03-08 11:09 [2015-12-01 11:09:32,
 2015-12-01 11:10:35] 25 2015-03-08 11:38 [2015-12-01 11:38:43,
 2015-12-01 11:40:49] 14 … … … … file index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … Realtime Storage Archive Storage PostgreSQL GiST (R-tree) Index on“time” column on the files Read from Archive Storage if merged. Otherwise, from Realtime Storage
  19. 19. Data Importing > Scalable & Reliable importing > Fluentd buffers data on a disk > Import queue deduplicates uploaded chunks > Workers take the chunks and put to Realtime Storage > Instant visibility > Imported data is immediately visible by query engines. > Background workers merges the files every 1 hour. > Metadata > Index is built on PostgreSQL using RANGE type and
 GiST index
  20. 20. Data processing
  21. 21. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … MessagePack Columnar
 File Format
  22. 22. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL
  23. 23. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … time code method 2015-12-01 11:10:09 200 GET 2015-12-01 11:21:45 200 GET 2015-12-01 11:38:59 200 GET 2015-12-01 11:43:37 200 GET 2015-12-01 11:54:52 “200” GET … … … Archive Storage path index range records [2015-12-01 10:00:00,
 2015-12-01 11:00:00] 3,312 [2015-12-01 11:00:00,
 2015-12-01 12:00:00] 2,143 … … … column-based partitioning time-based partitioning Files on Amazon S3 / Basho Riak CS Metadata on PostgreSQL SELECT code, COUNT(1) FROM logs WHERE time >= 2015-12-01 11:00:00
 GROUP BY code
  24. 24. Handling Eventual Consistency 1. Writing data / metadata first > At this time, data is not visible 2. Check data is available or not > GET, GET, GET… 3. Data become visible > Query includes imported data!
 Ex. Netflix case > https://github.com/Netflix/s3mper
  25. 25. Hide network cost > Open a lot of connections to Object Storage > Using range feature with columnar offset > Improve scan performance for partitioned data > Detect recoverable error > We have error lists for fault tolerance > Stall checker > Watch the progress of reading data > If processing time reached threshold, re-connect to OS and re-read data
  26. 26. buffer Optimizing Scan Performance •  Fully utilize the network bandwidth from S3 •  TD Presto becomes CPU bottleneck 8 TableScanOperator •  s3 file list •  table schema header request S3 / RiakCS •  release(Buffer) Buffer size limit Reuse allocated buffers Request Queue •  priority queue •  max connections limit Header Column Block 0 (column names) Column Block 1 Column Block i Column Block m MPC1 file HeaderReader •  callback to HeaderParser ColumnBlockReader header HeaderParser •  parse MPC file header • column block offsets • column names column block request Column block requests column block prepare MessageUnpacker buffer MessageUnpacker MessageUnpacker S3 read S3 read pull records Retry GET request on - 500 (internal error) - 503 (slow down) - 404 (not found) - eventual consistency S3 read•  decompression •  msgpack-java v07 S3 read S3 read S3 read Optimize scan performance
  27. 27. Recoverable errors > Error types > User error > Syntax error, Semantic error > Insufficient resource > Exceeded task memory size > Internal failure > I/O error of S3 / Riak CS > worker failure > etc We can retry these patterns
  28. 28. Recoverable errors > Error types > User error > Syntax error, Semantic error > Insufficient resource > Exceeded task memory size > Internal failure > I/O error of S3 / Riak CS > worker failure > etc We can retry these patterns
  29. 29. Presto retry on Internal Errors > Query succeed eventually
 
 
 
 
 
 
 
 log scale
  30. 30. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … …
  31. 31. time code method 2015-12-01 10:02:36 200 GET 2015-12-01 10:22:09 404 GET 2015-12-01 10:36:45 200 GET 2015-12-01 10:49:21 200 POST … … … user time code method 391 2015-12-01 11:10:09 200 GET 482 2015-12-01 11:21:45 200 GET 573 2015-12-01 11:38:59 200 GET 664 2015-12-01 11:43:37 200 GET 755 2015-12-01 11:54:52 “200” GET … … … MessagePack Columnar
 File Format is schema-less ✓ Instant schema change SQL is schema-full ✓ SQL doesn’t work
 without schema Schema-on-Read
  32. 32. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} Schema-on-Read Schema-full Schema-less
  33. 33. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage Schema-full Schema-less Schema {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (
 user INT, name STRING, value INT, host INT ); | user | 54 | value | 120 | host | NULL | | Schema-on-Read | name | “plazma”
  34. 34. Realtime Storage Query Engine
 Hive, Pig, Presto Archive Storage {“user”:54, “name”:”plazma”, “value”:”120”, “host”:”local”} CREATE TABLE events (
 user INT, name STRING, value INT, host INT ); | user | 54 | name | “plazma” | value | 120 | host | NULL | | Schema-on-Read Schema-full Schema-less Schema
  35. 35. Streaming logging layer Reliable forwarding Pluggable architecture http://fluentd.org/
  36. 36. Bulk loading Parallel processing Pluggable architecture http://embulk.org/
  37. 37. Hadoop > Distributed computing framework > Consist of many components…
 
 
 
 
 
 
 http://hortonworks.com/hadoop-tutorial/introducing-apache-hadoop-developers/
  38. 38. Presto > > Open sourced by Facebook > https://github.com/facebook/presto
 
 
 A distributed SQL query engine
 for interactive data analisys
 against GBs to PBs of data.
  39. 39. Conclusion > Build scalable data analytics platform on Cloud > Separate resource and storage > loosely-coupled components > We have lots of useful OSS and services :) > There are many trade-off > Use existing component or create new component? > Stick to the basics! > If you tired, please use Treasure Data ;)
  40. 40. https://jobs.lever.co/treasure-data Cloud service for the entire data pipeline.

×