Data Analytics Service Company and Its Ruby Usage

Data Analytics Service Company
and Its Ruby Usage
EuRuKo 2015 (Oct 17, 2015)
Satoshi Tagomori (@tagomoris)

Satoshi "Moris" Tagomori
(@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, ...
Treasure Data, Inc.

Data Analytics Platform
Data Analytics Service

Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring

• Data collection, storage
• Console & API endpoints
• Schema management
• Processing (batch, query, ...)
• Queuing & Scheduling
• Data connector/exporter

• Data collection, storage: Ruby(OSS), Java/JRuby(OSS)
• Console & API endpoints: Ruby(RoR)
• Schema management: Ruby/Java (MessagePack)
• Processing (batch, query, ...): Java(Hadoop,Presto)
• Queuing & Scheduling: Ruby(OSS)
• Data connector/exporter: Java, Java/JRuby(OSS)

Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
USERS
TD SDKs
SERVERS
DataConnector
CUSTOMER's
SYSTEMS

OSS products
• To make logging more easy & simple than ever!
• Plugin system
• Open development
• For various environment/usage
• Fluentd, Fluent-Bit, Embulk
• Fluent-Bit: Data collector for Embedded Linux
http://ﬂuentbit.io/

http://www.fluentd.org/
Fluentd
Unified Logging Layer
For Stream Data
Written in CRuby
http://www.slideshare.net/treasure-data/the-basics-of-fluentd-35681111

Bulk Data Loader
High Throughput&Reliability
Embulk
Written in Java/JRuby
http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed
http://www.embulk.org/

HDFS
MySQL
Amazon S3
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data validation
✓ Error recovery
✓ Deterministic behavior
✓ Idempotent retrying
Plugins Plugins
bulk load

Console/API
• RoR + AWS RDS + AngularJS
• on EC2 (API) and Heroku (Console)
• Operation, Conﬁguration & Managing Data

Collecting Data
• Import over Console/API
• From browsers and CLI (TD toolbelt)
• Treasure Agent (rpm/deb)
• Fluentd packaged by Treasure Data
• Post from JavaScript/iOS/Android SDK
• To EventCollector (HTTP endpoint for SDKs, impl. w/ Fluentd)

DataConnector
• Data bulk loader for various data sources
• Load customers' data to Treasure Data
• S3, Redshift, MySQL, PostgreSQL, Salesforce, ...
• Hosted Embulk
• Much computing resources
• Distributed execution on Hadoop MapReduce

Hadoop, Presto clusters
• Some Hadoop/Presto clusters
• We're OSS products itself, not customized one
• with minimal patches for storage I/O

Queue/Worker, Scheduler
• Treasure Data: multi-tenant data analytics service
• executes many jobs in shared clusters (queries,
imports, ...)
• CORE: queues-workers & schedulers
• Clusters have queues/scheduler... it's not enough
• resource limitations for each price plans
• priority queues for job types
• and many others

PerfectQueue
https://github.com/treasure-data/perfectqueue

PerfectQueue
• Highly available distributed queue using RDBMS
• Written in CRuby
• Enqueue by INSERT INTO
• Dequeue/Commit by UPDATE
• Flexible scheduling rather than scalability
• Using Amazon RDS (MySQL) internally
• + Workers on EC2

PerfectSched
https://github.com/treasure-data/perfectsched

PerfectSched
• Highly available distributed scheduler using RDBMS
• Written in CRuby
• At-least-one semantics
• PerfectSched enqueues jobs into PerfectQueue

Storage, Schema
• Another core technology for Treasure Data service
• High performance, schema on read, less cost
• columnar ﬁle format
• high throughput & high concurrency
• compression
• Less schema management
• for customers

PlazmaDB
http://www.slideshare.net/treasure-data/td-techplazma

PlazmaDB
• Distributed database using RDBMS & Distributed FS
• metadata on RDBMS, data chunks on DFS
• Amazon RDS(PostgreSQL) + Amazon S3 / Riak CS
• High throughput & high availability by S3
• Columnar format based on MessagePack
• time based chunking for time series data

Monitoring
• Using DataDog for internal operations
• Monitoring for our customers required:
• How many records are they importing?
• How many jobs are they executing?
• How many threads/processes is a job consuming?

PerfectMonitor
• Is still under construction :P
• Fluentd based metrics collection
• Detailed metric for real-time, summarized for past
• Real-time metric storage using InﬂuxDB
• Historic metric storage using Treasure Data
• Real-time data series are disposable :D
• Potential next OSS product from Treasure Data

For Further improvement
• More performance for more customers
• Dynamic scaling for better performance and less
cost
• New analytics features for brand new experience

"Done is better than Perfect."

We'll improve our code step by step,
with improvements of ruby and developer
community <3
Thanks!

Data Analytics Service Company and Its Ruby Usage

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Analytics Service Company and Its Ruby Usage

Similar to Data Analytics Service Company and Its Ruby Usage (20)

More from SATOSHI TAGOMORI

More from SATOSHI TAGOMORI (20)

Recently uploaded

Recently uploaded (20)

Data Analytics Service Company and Its Ruby Usage