Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Analytics Service Company
and Its Ruby Usage
EuRuKo 2015 (Oct 17, 2015)
Satoshi Tagomori (@tagomoris)
Satoshi "Moris" Tagomori
(@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, ...
Treasure Data, Inc.
HQ Branch
http://www.treasuredata.com/
Data Analytics Platform
Data Analytics Service
Services
Services
JVM
Services
JVMC++
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
Data Analytics Flow
Collect Store Process Visualize
Data source
Reporting
Monitoring
Data Analytics Platform
• Data collection, storage
• Console & API endpoints
• Schema management
• Processing (batch, quer...
Treasure Data Internals
Data Analytics Platform
• Data collection, storage: Ruby(OSS), Java/JRuby(OSS)
• Console & API endpoints: Ruby(RoR)
• Sche...
Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
US...
OSS products
• To make logging more easy & simple than ever!
• Plugin system
• Open development
• For various environment/...
http://www.fluentd.org/
Fluentd
Unified Logging Layer
For Stream Data
Written in CRuby
http://www.slideshare.net/treasure-da...
Bulk Data Loader
High Throughput&Reliability
Embulk
Written in Java/JRuby
http://www.slideshare.net/frsyuki/embuk-making-d...
HDFS
MySQL
Amazon S3
CSV Files
SequenceFile
Salesforce.com
Elasticsearch
Cassandra
Hive
Redis
✓ Parallel execution
✓ Data ...
Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
US...
Console/API
• RoR + AWS RDS + AngularJS
• on EC2 (API) and Heroku (Console)
• Operation, Configuration & Managing Data
Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
US...
Collecting Data
• Import over Console/API
• From browsers and CLI (TD toolbelt)
• Treasure Agent (rpm/deb)
• Fluentd packa...
Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
US...
DataConnector
• Data bulk loader for various data sources
• Load customers' data to Treasure Data
• S3, Redshift, MySQL, P...
Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
US...
Hadoop, Presto clusters
• Some Hadoop/Presto clusters
• We're OSS products itself, not customized one
• with minimal patch...
Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
US...
Queue/Worker, Scheduler
• Treasure Data: multi-tenant data analytics service
• executes many jobs in shared clusters (quer...
PerfectQueue
https://github.com/treasure-data/perfectqueue
PerfectQueue
• Highly available distributed queue using RDBMS
• Written in CRuby
• Enqueue by INSERT INTO
• Dequeue/Commit...
PerfectSched
https://github.com/treasure-data/perfectsched
PerfectSched
• Highly available distributed scheduler using RDBMS
• Written in CRuby
• At-least-one semantics
• PerfectSch...
Storage, Schema
• Another core technology for Treasure Data service
• High performance, schema on read, less cost
• column...
Treasure Data Architecture: Overview
Console
API
EventCollector
PlazmaDB
Worker
Scheduler
Hadoop
Cluster
Presto
Cluster
US...
PlazmaDB
http://www.slideshare.net/treasure-data/td-techplazma
PlazmaDB
• Distributed database using RDBMS & Distributed FS
• metadata on RDBMS, data chunks on DFS
• Amazon RDS(PostgreS...
Monitoring
• Using DataDog for internal operations
• Monitoring for our customers required:
• How many records are they im...
PerfectMonitor
PerfectMonitor
• Is still under construction :P
• Fluentd based metrics collection
• Detailed metric for real-time, summar...
For Further improvement
• More performance for more customers
• Dynamic scaling for better performance and less
cost
• New...
"Done is better than Perfect."
DoneQueue?
We'll improve our code step by step,
with improvements of ruby and developer
community <3
Thanks!
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
Upcoming SlideShare
Loading in …5
×

Data Analytics Service Company and Its Ruby Usage

7,111 views

Published on

euruko2015 presentation

Published in: Technology

Data Analytics Service Company and Its Ruby Usage

  1. 1. Data Analytics Service Company and Its Ruby Usage EuRuKo 2015 (Oct 17, 2015) Satoshi Tagomori (@tagomoris)
  2. 2. Satoshi "Moris" Tagomori (@tagomoris) Fluentd, MessagePack-Ruby, Norikra, ... Treasure Data, Inc.
  3. 3. HQ Branch
  4. 4. http://www.treasuredata.com/
  5. 5. Data Analytics Platform Data Analytics Service
  6. 6. Services
  7. 7. Services JVM
  8. 8. Services JVMC++
  9. 9. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  10. 10. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  11. 11. Data Analytics Platform • Data collection, storage • Console & API endpoints • Schema management • Processing (batch, query, ...) • Queuing & Scheduling • Data connector/exporter
  12. 12. Treasure Data Internals
  13. 13. Data Analytics Platform • Data collection, storage: Ruby(OSS), Java/JRuby(OSS) • Console & API endpoints: Ruby(RoR) • Schema management: Ruby/Java (MessagePack) • Processing (batch, query, ...): Java(Hadoop,Presto) • Queuing & Scheduling: Ruby(OSS) • Data connector/exporter: Java, Java/JRuby(OSS)
  14. 14. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  15. 15. OSS products • To make logging more easy & simple than ever! • Plugin system • Open development • For various environment/usage • Fluentd, Fluent-Bit, Embulk • Fluent-Bit: Data collector for Embedded Linux http://fluentbit.io/
  16. 16. http://www.fluentd.org/ Fluentd Unified Logging Layer For Stream Data Written in CRuby http://www.slideshare.net/treasure-data/the-basics-of-fluentd-35681111
  17. 17. Bulk Data Loader High Throughput&Reliability Embulk Written in Java/JRuby http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed http://www.embulk.org/
  18. 18. HDFS MySQL Amazon S3 CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotent retrying Plugins Plugins bulk load
  19. 19. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  20. 20. Console/API • RoR + AWS RDS + AngularJS • on EC2 (API) and Heroku (Console) • Operation, Configuration & Managing Data
  21. 21. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  22. 22. Collecting Data • Import over Console/API • From browsers and CLI (TD toolbelt) • Treasure Agent (rpm/deb) • Fluentd packaged by Treasure Data • Post from JavaScript/iOS/Android SDK • To EventCollector (HTTP endpoint for SDKs, impl. w/ Fluentd)
  23. 23. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  24. 24. DataConnector • Data bulk loader for various data sources • Load customers' data to Treasure Data • S3, Redshift, MySQL, PostgreSQL, Salesforce, ... • Hosted Embulk • Much computing resources • Distributed execution on Hadoop MapReduce
  25. 25. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  26. 26. Hadoop, Presto clusters • Some Hadoop/Presto clusters • We're OSS products itself, not customized one • with minimal patches for storage I/O
  27. 27. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  28. 28. Queue/Worker, Scheduler • Treasure Data: multi-tenant data analytics service • executes many jobs in shared clusters (queries, imports, ...) • CORE: queues-workers & schedulers • Clusters have queues/scheduler... it's not enough • resource limitations for each price plans • priority queues for job types • and many others
  29. 29. PerfectQueue https://github.com/treasure-data/perfectqueue
  30. 30. PerfectQueue • Highly available distributed queue using RDBMS • Written in CRuby • Enqueue by INSERT INTO • Dequeue/Commit by UPDATE • Flexible scheduling rather than scalability • Using Amazon RDS (MySQL) internally • + Workers on EC2
  31. 31. PerfectSched https://github.com/treasure-data/perfectsched
  32. 32. PerfectSched • Highly available distributed scheduler using RDBMS • Written in CRuby • At-least-one semantics • PerfectSched enqueues jobs into PerfectQueue
  33. 33. Storage, Schema • Another core technology for Treasure Data service • High performance, schema on read, less cost • columnar file format • high throughput & high concurrency • compression • Less schema management • for customers
  34. 34. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  35. 35. PlazmaDB http://www.slideshare.net/treasure-data/td-techplazma
  36. 36. PlazmaDB • Distributed database using RDBMS & Distributed FS • metadata on RDBMS, data chunks on DFS • Amazon RDS(PostgreSQL) + Amazon S3 / Riak CS • High throughput & high availability by S3 • Columnar format based on MessagePack • time based chunking for time series data
  37. 37. Monitoring • Using DataDog for internal operations • Monitoring for our customers required: • How many records are they importing? • How many jobs are they executing? • How many threads/processes is a job consuming?
  38. 38. PerfectMonitor
  39. 39. PerfectMonitor • Is still under construction :P • Fluentd based metrics collection • Detailed metric for real-time, summarized for past • Real-time metric storage using InfluxDB • Historic metric storage using Treasure Data • Real-time data series are disposable :D • Potential next OSS product from Treasure Data
  40. 40. For Further improvement • More performance for more customers • Dynamic scaling for better performance and less cost • New analytics features for brand new experience
  41. 41. "Done is better than Perfect."
  42. 42. DoneQueue?
  43. 43. We'll improve our code step by step, with improvements of ruby and developer community <3 Thanks!

×