Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Analytics Service Company and Its Ruby Usage

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 48 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to Data Analytics Service Company and Its Ruby Usage (20)

Advertisement

More from SATOSHI TAGOMORI (20)

Recently uploaded (20)

Advertisement

Data Analytics Service Company and Its Ruby Usage

  1. 1. Data Analytics Service Company and Its Ruby Usage EuRuKo 2015 (Oct 17, 2015) Satoshi Tagomori (@tagomoris)
  2. 2. Satoshi "Moris" Tagomori (@tagomoris) Fluentd, MessagePack-Ruby, Norikra, ... Treasure Data, Inc.
  3. 3. HQ Branch
  4. 4. http://www.treasuredata.com/
  5. 5. Data Analytics Platform Data Analytics Service
  6. 6. Services
  7. 7. Services JVM
  8. 8. Services JVMC++
  9. 9. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  10. 10. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  11. 11. Data Analytics Platform • Data collection, storage • Console & API endpoints • Schema management • Processing (batch, query, ...) • Queuing & Scheduling • Data connector/exporter
  12. 12. Treasure Data Internals
  13. 13. Data Analytics Platform • Data collection, storage: Ruby(OSS), Java/JRuby(OSS) • Console & API endpoints: Ruby(RoR) • Schema management: Ruby/Java (MessagePack) • Processing (batch, query, ...): Java(Hadoop,Presto) • Queuing & Scheduling: Ruby(OSS) • Data connector/exporter: Java, Java/JRuby(OSS)
  14. 14. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  15. 15. OSS products • To make logging more easy & simple than ever! • Plugin system • Open development • For various environment/usage • Fluentd, Fluent-Bit, Embulk • Fluent-Bit: Data collector for Embedded Linux http://fluentbit.io/
  16. 16. http://www.fluentd.org/ Fluentd Unified Logging Layer For Stream Data Written in CRuby http://www.slideshare.net/treasure-data/the-basics-of-fluentd-35681111
  17. 17. Bulk Data Loader High Throughput&Reliability Embulk Written in Java/JRuby http://www.slideshare.net/frsyuki/embuk-making-data-integration-works-relaxed http://www.embulk.org/
  18. 18. HDFS MySQL Amazon S3 CSV Files SequenceFile Salesforce.com Elasticsearch Cassandra Hive Redis ✓ Parallel execution ✓ Data validation ✓ Error recovery ✓ Deterministic behavior ✓ Idempotent retrying Plugins Plugins bulk load
  19. 19. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  20. 20. Console/API • RoR + AWS RDS + AngularJS • on EC2 (API) and Heroku (Console) • Operation, Configuration & Managing Data
  21. 21. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  22. 22. Collecting Data • Import over Console/API • From browsers and CLI (TD toolbelt) • Treasure Agent (rpm/deb) • Fluentd packaged by Treasure Data • Post from JavaScript/iOS/Android SDK • To EventCollector (HTTP endpoint for SDKs, impl. w/ Fluentd)
  23. 23. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  24. 24. DataConnector • Data bulk loader for various data sources • Load customers' data to Treasure Data • S3, Redshift, MySQL, PostgreSQL, Salesforce, ... • Hosted Embulk • Much computing resources • Distributed execution on Hadoop MapReduce
  25. 25. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  26. 26. Hadoop, Presto clusters • Some Hadoop/Presto clusters • We're OSS products itself, not customized one • with minimal patches for storage I/O
  27. 27. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  28. 28. Queue/Worker, Scheduler • Treasure Data: multi-tenant data analytics service • executes many jobs in shared clusters (queries, imports, ...) • CORE: queues-workers & schedulers • Clusters have queues/scheduler... it's not enough • resource limitations for each price plans • priority queues for job types • and many others
  29. 29. PerfectQueue https://github.com/treasure-data/perfectqueue
  30. 30. PerfectQueue • Highly available distributed queue using RDBMS • Written in CRuby • Enqueue by INSERT INTO • Dequeue/Commit by UPDATE • Flexible scheduling rather than scalability • Using Amazon RDS (MySQL) internally • + Workers on EC2
  31. 31. PerfectSched https://github.com/treasure-data/perfectsched
  32. 32. PerfectSched • Highly available distributed scheduler using RDBMS • Written in CRuby • At-least-one semantics • PerfectSched enqueues jobs into PerfectQueue
  33. 33. Storage, Schema • Another core technology for Treasure Data service • High performance, schema on read, less cost • columnar file format • high throughput & high concurrency • compression • Less schema management • for customers
  34. 34. Treasure Data Architecture: Overview Console API EventCollector PlazmaDB Worker Scheduler Hadoop Cluster Presto Cluster USERS TD SDKs SERVERS DataConnector CUSTOMER's SYSTEMS
  35. 35. PlazmaDB http://www.slideshare.net/treasure-data/td-techplazma
  36. 36. PlazmaDB • Distributed database using RDBMS & Distributed FS • metadata on RDBMS, data chunks on DFS • Amazon RDS(PostgreSQL) + Amazon S3 / Riak CS • High throughput & high availability by S3 • Columnar format based on MessagePack • time based chunking for time series data
  37. 37. Monitoring • Using DataDog for internal operations • Monitoring for our customers required: • How many records are they importing? • How many jobs are they executing? • How many threads/processes is a job consuming?
  38. 38. PerfectMonitor
  39. 39. PerfectMonitor • Is still under construction :P • Fluentd based metrics collection • Detailed metric for real-time, summarized for past • Real-time metric storage using InfluxDB • Historic metric storage using Treasure Data • Real-time data series are disposable :D • Potential next OSS product from Treasure Data
  40. 40. For Further improvement • More performance for more customers • Dynamic scaling for better performance and less cost • New analytics features for brand new experience
  41. 41. "Done is better than Perfect."
  42. 42. DoneQueue?
  43. 43. We'll improve our code step by step, with improvements of ruby and developer community <3 Thanks!

×