Your SlideShare is downloading. ×

Fluentd meetup #3


Published on

Fluentd meetup #3 #fluentd

Fluentd meetup #3 #fluentd

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Collecting app metricsin decentralized systemsDecision making based on factsSadayuki FuruhashiTreasuare Data, Inc.Founder & Software Architect Fluentd meetup #3
  • 2. Self-introduction> Sadayuki Furuhashi> Treasure Data, Inc. Founder & Software Architect> Open source projects MessagePack - efficient serializer (original author) Fluentd - event collector (original author)
  • 3. What’s our service?What’s the problems we faced?How did we solve them? My TalkWhat did we learn?We open sourced the system
  • 4. What’s Treasure Data?Treasure Data provides cloud-based data warehouseas a service.
  • 5. Treasure Data Service Architecture open sourced Apache App Treasure Data td-agent columnar data App RDBMS warehouse Other data sources MAPREDUCE JOBS HIVE, PIG (to be supported) td-command Query Query Processing API JDBC, REST ClusterUser BI apps
  • 6. Example Use Case – MySQL to TDhundreds of app servers Rails app writes logs to text files MySQL Daily/Hourly Google Nightly Batch Spreadsheet INSERT Rails app MySQL writes logs to text files MySQL MySQL Rails app writes logs to text files KPI Feedback rankings visualization- Limited scalability- Fixed schema- Not realtime- Unexpected INSERT latency
  • 7. Example Use Case – MySQL to TDhundreds of app servers Rails app td-agent sends event logs Daily/Hourly Google Batch Spreadsheet Rails app td-agent Treasure Data sends event logs MySQL Rails app td-agent Logs are available sends event logs after several mins. KPI Feedback rankings visualization Unlimited scalability Flexible schema Realtime Less performance impact
  • 8. What’s Treasure Data?Key differentiators:> TD delivers BigData analytics> in days, not months> without specialists or IT resources> for 1/10th the cost of the alternativesWhy? Because it’s a multi-tenant service.
  • 9. Problem 1:investigating problems took timeCustomers need support... > “I uploaded data but can’t get on queries” > “Download query results take time” > “Our queries take longer time recently”
  • 10. Problem 1:investigating problems took timeInvestigating these problems took timebecause: doubts.count.times { servers.count.times { ssh to a server grep logs } }
  • 11. * the actual facts> Actually data were not uploaded (clients had a problem; disk full) We had ought to monitor uploading so that we immediately know we’re not getting data from the user.> Our servers were getting slower because of increasing load We had ought to notice it and add servers before having the problem.> There was a bug which occurs under a specific condition We had ought to collect unexpected errors and fix it as soon as possible so that both we and users save time.
  • 12. Problem 2:many tasks to do but hard to prioritizeWe want to do... > fix bugs > improve performance > increase number of sign-ups > increase number of queries by customers > incrasse number of periodic queriesWhat’s the “bottleneck”, whch should besolved first?
  • 13. Problem 2:many tasks to do but hard to prioritizeWe need data to make decision. data: Performance is getting worse. decision: Let’s add servers. data: Many customers upload data but few customers issue queries. decision: Let’s improve documents. data: A customer stopped to run upload data. decision: They might got a problem at the client side.
  • 14. How did we solve?We collected application metrics.
  • 15. Treasure Data’s backend architectureFrontend Worker Job Queue Hadoop Hadoop
  • 16. Solution v1: Frontend Worker Job Queue Hadoop Hadoop Fluentd pulls metrics every minuts Fluentd (in_exec plugin) Treasure Data Librato Metricsfor historical analysis for realtime analysis
  • 17. What’s solvedWe can monitor overal behavior of servers.We can notice performance decreasing.We can get alerts when a problem occurs.
  • 18. What’s not solvedWe can’t get detailed information. > how large data is “this user” uploading?Configuration file is complicated. > we need to add lines to declare new metricsMonitoring server is SPOF.
  • 19. Solution v2: Frontend Worker Job Queue Hadoop Hadoop Applications push metrics to Fluentd sums up data minuts (via local Fluentd) Fluentd Fluentd (partial aggregation) Treasure Data Librato Metricsfor historical analysis for realtime analysis
  • 20. What’s solved by v2We can get detailed information directly fromapplications > graphs for each customersDRY - we can keep configuration files simple > Just add one line to apps > No needs to update fluentd.confDecentralized streaming aggregation > partial aggregation on fluentd, total aggregation on Librato Metrics
  • 21. APIMetricSense.value {:size=>32}MetricSense.segment {:account=>1}MetricSense.fact {:path=>‘/path1’}MetricSense.measure!
  • 22. What did we learn?> We always have lots of tasks > we need data to prioritize them.> Problems are usually complicated > we need data to save time.> Adding metrics should be DRY > otherwise you feel bored and will not add metrics.> Realtime analysis is useful, but we still need batch analysis. > “who are not issuing queries, despite of storing data last month?” > “which pages did users look before sign-up?” > “which pages did not users look before getting trouble?”
  • 23. We open sourced MetricSense
  • 24. Components of MetricSensemetricsense.gem > client library for Ruby to send metricsfluent-plugin-metricsense > plugin for Fluentd to collect metrics > pluggable backends:> Librato Metrics backend> RDBMS backend
  • 25. RDB backend for MetricSenseAggregate metrics on RDBMS in optimizedform for time-series data. > Borrowed concepts from OpenTSDB and OLAP cube.metric_tags: segment_values: metric_id, metric_name, segment_name segment_id, name 1 “import.size” NULL 5 “a001” 2 “import.size” “account” 6 “a002”data: base_time, metric_id, segment_id, m0, m1, m2, ..., m59 19:00 1 5 25 31 19 ... 21 21:00 2 5 75 94 68 ... 72 21:00 2 6 63 82 55 ... 63
  • 26. Solution v3 (future work):Alerting using historical data > simple machine largning to adjust threashold values Historical average Alert!
  • 27. We’re Hiring!
  • 28. Sales Engineer Evangelize TD/Fluentd. Get everyone excited! Help customers deploy and maintain TD successfully. Preferred experience: OS, DB, BI, statistics and data scienceDevops engineer Development, operation and monitoring of our large- scale, multi-tenant system Preferred experience: large-scale system development and management
  • 29. Competitive salary + equity packageWho we want STRONG business and customer support DNA Everyone is equally responsible for customer support Customer success = our success Self-discipline and responsible Be your own manager Team player with excellent communication skills Distributed team and global customer baseContact me:
  • 30. contact: