Collecting app metrics
in decentralized systems
Decision making based on facts



Sadayuki Furuhashi
Treasuare Data, Inc.
Founder & Software Architect     Fluentd meetup #3
Self-introduction

>   Sadayuki Furuhashi
>   Treasure Data, Inc.
    Founder & Software Architect

>   Open source projects
    MessagePack - efficient serializer (original author)
    Fluentd - event collector (original author)
What’s our service?

What’s the problems we faced?

How did we solve them?          My Talk
What did we learn?

We open sourced the system
What’s Treasure Data?




Treasure Data provides cloud-based data warehouse
as a service.
Treasure Data Service Architecture
                                                open sourced

   Apache

   App                                                        Treasure Data
                             td-agent                         columnar data
   App         RDBMS                                           warehouse

   Other data sources

                                                                     MAPREDUCE JOBS

                        HIVE, PIG (to be supported)
         td-command
                                                                   Query
                                                      Query
                                                                   Processing
                                                       API
                        JDBC, REST                                 Cluster
User        BI apps
Example Use Case – MySQL to TD
hundreds of app servers


   Rails app
           writes logs to text files                MySQL   Daily/Hourly      Google
                                         Nightly            Batch           Spreadsheet
                                         INSERT
   Rails app                                        MySQL
           writes logs to text files
                                                                             MySQL
                                                    MySQL
   Rails app
           writes logs to text files


                                                                      KPI
                                       Feedback rankings    visualization
- Limited scalability
- Fixed schema
- Not realtime
- Unexpected INSERT latency
Example Use Case – MySQL to TD
hundreds of app servers


  Rails app           td-agent
               sends event logs                            Daily/Hourly      Google
                                                           Batch           Spreadsheet

  Rails app           td-agent             Treasure Data
               sends event logs
                                                                            MySQL

  Rails app           td-agent
                                  Logs are available
               sends event logs
                                  after several mins.

                                                                     KPI
                                  Feedback rankings        visualization
  Unlimited scalability
  Flexible schema
  Realtime
  Less performance impact
What’s Treasure Data?

Key differentiators:
>   TD delivers BigData analytics
>   in days, not months
>   without specialists or IT resources
>   for 1/10th the cost of the alternatives
Why? Because it’s a multi-tenant service.
Problem 1:
investigating problems took time


Customers need support...
 >   “I uploaded data but can’t get on queries”
 >   “Download query results take time”
 >   “Our queries take longer time recently”
Problem 1:
investigating problems took time

Investigating these problems took time
because:

        doubts.count.times {
            servers.count.times {
                ssh to a server
                grep logs
            }
        }
* the actual facts
>   Actually data were not uploaded
    (clients had a problem; disk full)
     We had ought to monitor uploading so that we immediately know
     we’re not getting data from the user.

>   Our servers were getting slower because of increasing
    load
     We had ought to notice it and add servers before having the problem.
>   There was a bug which occurs under a specific
    condition
     We had ought to collect unexpected errors and fix it as soon as
     possible so that both we and users save time.
Problem 2:
many tasks to do but hard to prioritize
We want to do...
 > fix bugs

 > improve performance

 > increase number of sign-ups

 > increase number of queries by customers

 > incrasse number of periodic queries

What’s the “bottleneck”, whch should be
solved first?
Problem 2:
many tasks to do but hard to prioritize

We need data to make decision.
 data: Performance is getting worse.
 decision: Let’s add servers.

 data: Many customers upload data but few customers issue queries.
 decision: Let’s improve documents.

 data: A customer stopped to run upload data.
 decision: They might got a problem at the client side.
How did we solve?


We collected application metrics.
Treasure Data’s backend architecture

Frontend               Worker
           Job Queue            Hadoop




                                Hadoop
Solution v1:

   Frontend                               Worker
                          Job Queue                             Hadoop




                                                                Hadoop


                                             Fluentd pulls metrics every minuts
                                Fluentd      (in_exec plugin)



  Treasure Data                                        Librato Metrics
for historical analysis                                for realtime analysis
What’s solved



We can monitor overal behavior of servers.

We can notice performance decreasing.
We can get alerts when a problem occurs.
What’s not solved


We can’t get detailed information.
 > how large data is “this user” uploading?


Configuration file is complicated.
 > we need to add lines to declare new metrics


Monitoring server is SPOF.
Solution v2:

   Frontend                           Worker
                          Job Queue                      Hadoop




                                                         Hadoop

 Applications push
 metrics to Fluentd
                                                   sums up data minuts
 (via local Fluentd)       Fluentd    Fluentd      (partial aggregation)


  Treasure Data                                 Librato Metrics
for historical analysis                         for realtime analysis
What’s solved by v2
We can get detailed information directly from
applications
 > graphs for each customers

DRY - we can keep configuration files simple
 > Just add one line to apps
 > No needs to update fluentd.conf

Decentralized streaming aggregation
 > partial aggregation on fluentd,

   total aggregation on Librato Metrics
API


MetricSense.value {:size=>32}
MetricSense.segment {:account=>1}
MetricSense.fact {:path=>‘/path1’}
MetricSense.measure!
What did we learn?
>   We always have lots of tasks
    > we need data to prioritize them.

>   Problems are usually complicated
    > we need data to save time.

>   Adding metrics should be DRY
    > otherwise you feel bored and will not add metrics.

>   Realtime analysis is useful,
    but we still need batch analysis.
    >   “who are not issuing queries, despite of storing data last month?”
    >   “which pages did users look before sign-up?”
    >   “which pages did not users look before getting trouble?”
We open sourced



     MetricSense
      https://github.com/treasure-data/metricsense
Components of MetricSense

metricsense.gem
 > client library for Ruby to send metrics

fluent-plugin-metricsense
  > plugin for Fluentd to collect metrics
  > pluggable backends:

>   Librato Metrics backend
>   RDBMS backend
RDB backend for MetricSense
Aggregate metrics on RDBMS in optimized
form for time-series data.
  > Borrowed concepts from OpenTSDB and

    OLAP cube.
metric_tags:                               segment_values:

 metric_id, metric_name, segment_name       segment_id,  name
         1 “import.size”         NULL                5 “a001”
         2 “import.size”    “account”                6 “a002”

data:
 base_time, metric_id, segment_id,   m0,   m1,   m2,   ...,   m59
     19:00          1           5    25    31    19    ...     21
     21:00          2           5    75    94    68    ...     72
     21:00          2           6    63    82    55    ...     63
Solution v3 (future work):

Alerting using historical data
 > simple machine largning to adjust threashold

   values



              Historical average
                                   Alert!
We’re Hiring!
Sales Engineer
  Evangelize TD/Fluentd. Get everyone excited!
  Help customers deploy and maintain TD successfully.
  Preferred experience: OS, DB, BI, statistics and data
  science

Devops engineer
  Development, operation and monitoring of our large-
  scale, multi-tenant system
  Preferred experience: large-scale system development
  and management
Competitive salary + equity package
Who we want
  STRONG business and customer support DNA
     Everyone is equally responsible for customer support
     Customer success = our success
  Self-discipline and responsible
     Be your own manager
  Team player with excellent communication skills
     Distributed team and global customer base

Contact me: sf@treasure-data.com
contact: sales@treasure-data.com

Fluentd meetup #3

  • 1.
    Collecting app metrics indecentralized systems Decision making based on facts Sadayuki Furuhashi Treasuare Data, Inc. Founder & Software Architect Fluentd meetup #3
  • 2.
    Self-introduction > Sadayuki Furuhashi > Treasure Data, Inc. Founder & Software Architect > Open source projects MessagePack - efficient serializer (original author) Fluentd - event collector (original author)
  • 3.
    What’s our service? What’sthe problems we faced? How did we solve them? My Talk What did we learn? We open sourced the system
  • 4.
    What’s Treasure Data? TreasureData provides cloud-based data warehouse as a service.
  • 5.
    Treasure Data ServiceArchitecture open sourced Apache App Treasure Data td-agent columnar data App RDBMS warehouse Other data sources MAPREDUCE JOBS HIVE, PIG (to be supported) td-command Query Query Processing API JDBC, REST Cluster User BI apps
  • 6.
    Example Use Case– MySQL to TD hundreds of app servers Rails app writes logs to text files MySQL Daily/Hourly Google Nightly Batch Spreadsheet INSERT Rails app MySQL writes logs to text files MySQL MySQL Rails app writes logs to text files KPI Feedback rankings visualization - Limited scalability - Fixed schema - Not realtime - Unexpected INSERT latency
  • 7.
    Example Use Case– MySQL to TD hundreds of app servers Rails app td-agent sends event logs Daily/Hourly Google Batch Spreadsheet Rails app td-agent Treasure Data sends event logs MySQL Rails app td-agent Logs are available sends event logs after several mins. KPI Feedback rankings visualization Unlimited scalability Flexible schema Realtime Less performance impact
  • 8.
    What’s Treasure Data? Keydifferentiators: > TD delivers BigData analytics > in days, not months > without specialists or IT resources > for 1/10th the cost of the alternatives Why? Because it’s a multi-tenant service.
  • 9.
    Problem 1: investigating problemstook time Customers need support... > “I uploaded data but can’t get on queries” > “Download query results take time” > “Our queries take longer time recently”
  • 10.
    Problem 1: investigating problemstook time Investigating these problems took time because: doubts.count.times { servers.count.times { ssh to a server grep logs } }
  • 11.
    * the actualfacts > Actually data were not uploaded (clients had a problem; disk full) We had ought to monitor uploading so that we immediately know we’re not getting data from the user. > Our servers were getting slower because of increasing load We had ought to notice it and add servers before having the problem. > There was a bug which occurs under a specific condition We had ought to collect unexpected errors and fix it as soon as possible so that both we and users save time.
  • 12.
    Problem 2: many tasksto do but hard to prioritize We want to do... > fix bugs > improve performance > increase number of sign-ups > increase number of queries by customers > incrasse number of periodic queries What’s the “bottleneck”, whch should be solved first?
  • 13.
    Problem 2: many tasksto do but hard to prioritize We need data to make decision. data: Performance is getting worse. decision: Let’s add servers. data: Many customers upload data but few customers issue queries. decision: Let’s improve documents. data: A customer stopped to run upload data. decision: They might got a problem at the client side.
  • 14.
    How did wesolve? We collected application metrics.
  • 15.
    Treasure Data’s backendarchitecture Frontend Worker Job Queue Hadoop Hadoop
  • 16.
    Solution v1: Frontend Worker Job Queue Hadoop Hadoop Fluentd pulls metrics every minuts Fluentd (in_exec plugin) Treasure Data Librato Metrics for historical analysis for realtime analysis
  • 18.
    What’s solved We canmonitor overal behavior of servers. We can notice performance decreasing. We can get alerts when a problem occurs.
  • 19.
    What’s not solved Wecan’t get detailed information. > how large data is “this user” uploading? Configuration file is complicated. > we need to add lines to declare new metrics Monitoring server is SPOF.
  • 20.
    Solution v2: Frontend Worker Job Queue Hadoop Hadoop Applications push metrics to Fluentd sums up data minuts (via local Fluentd) Fluentd Fluentd (partial aggregation) Treasure Data Librato Metrics for historical analysis for realtime analysis
  • 21.
    What’s solved byv2 We can get detailed information directly from applications > graphs for each customers DRY - we can keep configuration files simple > Just add one line to apps > No needs to update fluentd.conf Decentralized streaming aggregation > partial aggregation on fluentd, total aggregation on Librato Metrics
  • 23.
  • 24.
    What did welearn? > We always have lots of tasks > we need data to prioritize them. > Problems are usually complicated > we need data to save time. > Adding metrics should be DRY > otherwise you feel bored and will not add metrics. > Realtime analysis is useful, but we still need batch analysis. > “who are not issuing queries, despite of storing data last month?” > “which pages did users look before sign-up?” > “which pages did not users look before getting trouble?”
  • 25.
    We open sourced MetricSense https://github.com/treasure-data/metricsense
  • 26.
    Components of MetricSense metricsense.gem > client library for Ruby to send metrics fluent-plugin-metricsense > plugin for Fluentd to collect metrics > pluggable backends: > Librato Metrics backend > RDBMS backend
  • 27.
    RDB backend forMetricSense Aggregate metrics on RDBMS in optimized form for time-series data. > Borrowed concepts from OpenTSDB and OLAP cube. metric_tags: segment_values: metric_id, metric_name, segment_name segment_id, name 1 “import.size” NULL 5 “a001” 2 “import.size” “account” 6 “a002” data: base_time, metric_id, segment_id, m0, m1, m2, ..., m59 19:00 1 5 25 31 19 ... 21 21:00 2 5 75 94 68 ... 72 21:00 2 6 63 82 55 ... 63
  • 28.
    Solution v3 (futurework): Alerting using historical data > simple machine largning to adjust threashold values Historical average Alert!
  • 30.
  • 31.
    Sales Engineer Evangelize TD/Fluentd. Get everyone excited! Help customers deploy and maintain TD successfully. Preferred experience: OS, DB, BI, statistics and data science Devops engineer Development, operation and monitoring of our large- scale, multi-tenant system Preferred experience: large-scale system development and management
  • 32.
    Competitive salary +equity package Who we want STRONG business and customer support DNA Everyone is equally responsible for customer support Customer success = our success Self-discipline and responsible Be your own manager Team player with excellent communication skills Distributed team and global customer base Contact me: sf@treasure-data.com
  • 33.