Complex and simple ways to write to InfluxDB
1
Krystof Borkovec
IT-CM-IS
Structure of the talk
1) Use case
2) What I have tried
3) Current solution
2
1) Use case
• Grafana dashboard for HTCondor users
• Shows resource usage stats from cgroups
• CPU, memory, io
• Live, interval ~10s
• Simplify advanced job debugging and optimization
• Searchable by job ID (HTCondor global job ID)
• ~100K job slots (i.e. cgroups)
• InfluxDB schema:
• Ex. measurement name: cpu
• Ex. tags: global_job_id, host, slot, user
• Ex. values: avg_system, avg_user,…
• don’t know yet if InfluxDB can take it
3
2) What I have tried - technologies
• Others suggested using CAdvisor and collectd.
• There was a partial solution based on that which
didn’t allow for searching by Job ID.
• That means the following pipeline:
Cgroups > CAdvisor > cadvisor-collectd plugin >
collectd > collectd WriteGraphite plugin > InfluxDB
Graphite input templates > InfluxDB > Grafana
No official collectd plugin for InfluxDB, so we
pretend to send it to Graphite.
4
2) What I have tried - Graphite
Cadvisor-collectd plugin has to emit data in collectd
format – i.e. each record identified by:
(host, plugin, plugin_instance, type, type_instance)
(myhost.cern.ch, cpu, 2, cpu, idle)
Collectd WriteGraphite plugin will use it to construct
records of the following form:
myhost_cern_ch.cpu-2.cpu-idle 98.6103 1329168255
5
2) What I have tried - Graphite
• If you send a metric named
myhost_cern_ch.cpu-2.cpu-idle, InfluxDB will store full
metric name as measurement with no extracted tags.
• To extract tags, you can define templates in InfluxDB
configuration:
• With template: measurement.cpu_number.cpu_stat
• InfluxDB will store the record in measurement named
myhost_cern_ch with tags cpu_number=cpu-2 and
cpu_stat=cpu-idle
• It pushes you to misuse the structure of collectd record to get the
desired schema.
• Limited flexibility, fixed set of templates, no programmatic control.
• Templates are on the Influx server – difficult to debug and change!
6
2) What I have tried - enrichment
On top of problems with schema using WriteGraphite,
I had problems with the enrichment i.e. adding Job ID to the
cgroups data (output of condor_who command):
• Could I add Job ID in cadvisor-collectd plugin?
• Nope, it is HTCondor-specific, shouldn’t be there
• Could I add Job ID in collectd via multiple plugins?
• Cadvisor-collectd > WriteCSV
• Exec-plugin
• run condor_who command to get Job Slot – Job ID mapping
• Read CSV file written by cadvisor-collectd plugin
• Mix cgroups stats together with JobSlot – Job ID mapping
• Nope, that is hackish, there is no elegant way to enrich data in
collectd.
• Could I add Job ID in Flume Morphline?
7
2) What I have tried - Flume
Because:
• people were talking about using Flume
• I had problems with enriching data in collectd
• I had problems with getting the correct schema with Graphite plugin
I tried to add Flume into the pipeline to enrich the data and get
more flexibility with respect to the schema.
Cgroups > CAdvisor > cadvisor-collectd plugin >
collectd > collectd chains > collectd WriteHTTP
plugin > Flume agent > Flume morphlines > Flume
Write HTTP Sink > InfluxDB > Grafana
8
2) What I have tried - Flume
So I tried to understand how these technologies fit together:
• Cgroups
• Cadvisor
• Collectd
• Cadvisor plugin
• WriteHTTP plugin
• Collectd chains
• Flume
• Flume morphlines
• HMRC Flume Write HTTP Sink
• InfluxDB
• HTTP API
• Grafana
• Related Puppet modules
9
3) Current solution - cgs
The resulting pipeline with Flume:
• adds another 900 LOC in Java, 3 threads
• works, but insanely complex
I realized that:
• I actually don’t need all those things (CAdvisor, collectd, Flume)
• I have spent a huge amount of time to understand and fix stuff I don’t need.
• Things which should be easy are complex with collectd and Flume.
• Lxplus has the same problem with cgroups data enrichment.
So I wrote CGroups Simple (https://gitlab.cern.ch/batch-team/cgs):
• 1200 LOC in Python
• Reads directly cgroup files
• Writes directly to Influx HTTP API through requests Python library
• Much much simpler and easier to maintain
• Can be used by both batch and lxplus (and others)
• Generic, can write wherever, can be turned into collectd plugin if needed
• Ignacio Reguero contributed and extended it for lxplus accounting
• 10
My personal takeaway
• I have wasted a lot of time trying to understand technologies I didn’t need.
• When simple things get insanely complex, step back and think about simplicity.
• Other people are likely to face the same problem – no simple way to enrich
data with collectd/Flume – we should have a common solution.
11
Thank you for your attention.
Questions & feedback welcomed!
12
Complex and simple way to write influxdb

Complex and simple way to write influxdb

  • 1.
    Complex and simpleways to write to InfluxDB 1 Krystof Borkovec IT-CM-IS
  • 2.
    Structure of thetalk 1) Use case 2) What I have tried 3) Current solution 2
  • 3.
    1) Use case •Grafana dashboard for HTCondor users • Shows resource usage stats from cgroups • CPU, memory, io • Live, interval ~10s • Simplify advanced job debugging and optimization • Searchable by job ID (HTCondor global job ID) • ~100K job slots (i.e. cgroups) • InfluxDB schema: • Ex. measurement name: cpu • Ex. tags: global_job_id, host, slot, user • Ex. values: avg_system, avg_user,… • don’t know yet if InfluxDB can take it 3
  • 4.
    2) What Ihave tried - technologies • Others suggested using CAdvisor and collectd. • There was a partial solution based on that which didn’t allow for searching by Job ID. • That means the following pipeline: Cgroups > CAdvisor > cadvisor-collectd plugin > collectd > collectd WriteGraphite plugin > InfluxDB Graphite input templates > InfluxDB > Grafana No official collectd plugin for InfluxDB, so we pretend to send it to Graphite. 4
  • 5.
    2) What Ihave tried - Graphite Cadvisor-collectd plugin has to emit data in collectd format – i.e. each record identified by: (host, plugin, plugin_instance, type, type_instance) (myhost.cern.ch, cpu, 2, cpu, idle) Collectd WriteGraphite plugin will use it to construct records of the following form: myhost_cern_ch.cpu-2.cpu-idle 98.6103 1329168255 5
  • 6.
    2) What Ihave tried - Graphite • If you send a metric named myhost_cern_ch.cpu-2.cpu-idle, InfluxDB will store full metric name as measurement with no extracted tags. • To extract tags, you can define templates in InfluxDB configuration: • With template: measurement.cpu_number.cpu_stat • InfluxDB will store the record in measurement named myhost_cern_ch with tags cpu_number=cpu-2 and cpu_stat=cpu-idle • It pushes you to misuse the structure of collectd record to get the desired schema. • Limited flexibility, fixed set of templates, no programmatic control. • Templates are on the Influx server – difficult to debug and change! 6
  • 7.
    2) What Ihave tried - enrichment On top of problems with schema using WriteGraphite, I had problems with the enrichment i.e. adding Job ID to the cgroups data (output of condor_who command): • Could I add Job ID in cadvisor-collectd plugin? • Nope, it is HTCondor-specific, shouldn’t be there • Could I add Job ID in collectd via multiple plugins? • Cadvisor-collectd > WriteCSV • Exec-plugin • run condor_who command to get Job Slot – Job ID mapping • Read CSV file written by cadvisor-collectd plugin • Mix cgroups stats together with JobSlot – Job ID mapping • Nope, that is hackish, there is no elegant way to enrich data in collectd. • Could I add Job ID in Flume Morphline? 7
  • 8.
    2) What Ihave tried - Flume Because: • people were talking about using Flume • I had problems with enriching data in collectd • I had problems with getting the correct schema with Graphite plugin I tried to add Flume into the pipeline to enrich the data and get more flexibility with respect to the schema. Cgroups > CAdvisor > cadvisor-collectd plugin > collectd > collectd chains > collectd WriteHTTP plugin > Flume agent > Flume morphlines > Flume Write HTTP Sink > InfluxDB > Grafana 8
  • 9.
    2) What Ihave tried - Flume So I tried to understand how these technologies fit together: • Cgroups • Cadvisor • Collectd • Cadvisor plugin • WriteHTTP plugin • Collectd chains • Flume • Flume morphlines • HMRC Flume Write HTTP Sink • InfluxDB • HTTP API • Grafana • Related Puppet modules 9
  • 10.
    3) Current solution- cgs The resulting pipeline with Flume: • adds another 900 LOC in Java, 3 threads • works, but insanely complex I realized that: • I actually don’t need all those things (CAdvisor, collectd, Flume) • I have spent a huge amount of time to understand and fix stuff I don’t need. • Things which should be easy are complex with collectd and Flume. • Lxplus has the same problem with cgroups data enrichment. So I wrote CGroups Simple (https://gitlab.cern.ch/batch-team/cgs): • 1200 LOC in Python • Reads directly cgroup files • Writes directly to Influx HTTP API through requests Python library • Much much simpler and easier to maintain • Can be used by both batch and lxplus (and others) • Generic, can write wherever, can be turned into collectd plugin if needed • Ignacio Reguero contributed and extended it for lxplus accounting • 10
  • 11.
    My personal takeaway •I have wasted a lot of time trying to understand technologies I didn’t need. • When simple things get insanely complex, step back and think about simplicity. • Other people are likely to face the same problem – no simple way to enrich data with collectd/Flume – we should have a common solution. 11
  • 12.
    Thank you foryour attention. Questions & feedback welcomed! 12