Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Мониторинг облачной CI-системы на примере Jenkins / Александр Акбашев (HERE Technologies)

214 views

Published on

HighLoad++ 2017

Зал «Дели + Калькутта», 8 ноября, 16:00

Тезисы:
http://www.highload.ru/2017/abstracts/2503.html

В докладе представлен опыт создания системы мониторинга для большой CI-системы, включающей в себя 4 Jenkins-мастера, на самый большой из которых ежедневно приходится больше 100 тысяч сборок/билдов/запусков. Т.к. в нашей компании каждый коммит обязан пройди через CI, роль мониторинга CI огромна.
...

Published in: Engineering
  • I’ve personally never heard of companies who can produce a paper for you until word got around among my college groupmates. My professor asked me to write a research paper based on a field I have no idea about. My research skills are also very poor. So, I thought I’d give it a try. I chose a writer who matched my writing style and fulfilled every requirement I proposed. I turned my paper in and I actually got a good grade. I highly recommend ⇒ www.HelpWriting.net ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Мониторинг облачной CI-системы на примере Jenkins / Александр Акбашев (HERE Technologies)

  1. 1. Мониторинг облачной CI системы на примере Jenkins Alexander Akbashev HERE Technologies
  2. 2. Here Technologies HERE Technologies, the Open Location Platform company, enables people, enterprises and cities to harness the power of location. By making sense of the world through the lens of location we empower our customers to achieve better outcomes – from helping a city manage its infrastructure or an enterprise optimize its assets to guiding drivers to their destination safely. To learn more about HERE, including our new generation of cloud- based location platform services, visit http:// 360.here.com and www.here.com
  3. 3. Context • Every change goes through pre-submit validation • Feedback time is 15-40 minutes • A lot of products and platforms • 6 Jenkins masters • Up to 185k runs per day in the biggest one • 20k runs per day in average
  4. 4. if something goes wrong…
  5. 5. What can go wrong? Compilation is broken Tests are broken Network issues
  6. 6. What can go wrong? Compilation is broken Tests are broken Network issues Jenkins master crashed EC2 plugin does not raise new nodes No connection to labs Can not cleanup workspace AWS S3 is down Git master dies Git replica is broken Compiler cache was invalidated Hit the limit of API calls to AWS Job was deleted UI is blocked Queue is too big System.exit(1) NFS stuck Deadlock in Jenkins Staging started to give feedback Restarted the wrong server
  7. 7. What can go wrong? Compilation is broken Tests are broken Network issues Jenkins master crashed EC2 plugin does not raise new nodes No connection to labs Can not cleanup workspace AWS S3 is down Git master dies Git replica is broken Compiler cache was invalidated Hit the limit of API calls to AWS Job was deleted UI is blocked Queue is too big System.exit(1) NFS stuck Deadlock in Jenkins Staging started to give feedback Restarted the wrong server
  8. 8. Monitoring Jenkins Out of the box
  9. 9. Monitoring Jenkins © http://www.jenkinselectric.com/monitoring
  10. 10. Monitoring Jenkins https://jenkins.io/doc/book/system-administration/monitoring/
  11. 11. Monitoring Jenkins https://wiki.jenkins.io/display/JENKINS/Monitoring
  12. 12. Monitoring Plugin (March 2016)
  13. 13. Monitoring Plugin (March 2016) + Easy to install
  14. 14. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain
  15. 15. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring
  16. 16. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats
  17. 17. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance
  18. 18. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance - Not scalable
  19. 19. Monitoring Plugin (nowadays) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance - Not scalable + InfluxDB/CloudWatch/Graphite
  20. 20. Let’s craft own monitoring!
  21. 21. Design own monitoring (March 2016) Jenkins Python InfluxDB API API
  22. 22. Design own monitoring (March 2016) Jenkins Python InfluxDB import influxdb import jenkins j = Jenkins(“jenkins.host”) queue_info = j.get_queue_info() for q in queue_info: influx_server.push({“name”: q[‘job_name’], “reason”: q[‘why’]}) API API
  23. 23. Design own monitoring (March 2016) Jenkins Python InfluxDB import influxdb import jenkins j = Jenkins(“jenkins.host”) queue_info = j.get_queue_info() for q in queue_info: influx_server.push({“name”: q[‘job_name’], “reason”: q[‘why’]}) API API
  24. 24. Design own monitoring (March 2016) Jenkins Python InfluxDB import influxdb import jenkins j = Jenkins(“jenkins.host”) queue_info = j.get_queue_info() for q in queue_info: influx_server.push({“name”: q[‘job_name’], “reason”: q[‘why’]}) API API
  25. 25. Design own monitoring (March 2016) Jenkins Python InfluxDB API API
  26. 26. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple API API
  27. 27. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months API API
  28. 28. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling API API
  29. 29. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code API API
  30. 30. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible API API
  31. 31. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible - extra load API API
  32. 32. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible - extra load API API
  33. 33. Let’s do event based monitoring!
  34. 34. Jenkins Core public abstract class RunListener<R extends Run> implements ExtensionPoint { public void onCompleted(R r, TaskListener listener) {}
 
 public void onFinalized(R r) {}
 
 public void onStarted(R r, TaskListener listener) {} public void onDeleted(R r) {} }
  35. 35. Jenkins Core public abstract class RunListener<R extends Run> implements ExtensionPoint { public void onCompleted(R r, TaskListener listener) {}
 
 public void onFinalized(R r) {}
 
 public void onStarted(R r, TaskListener listener) {} public void onDeleted(R r) {} }
  36. 36. Groovy Event Listener Plugin (April 2016) • Allows to execute custom groovy code for every event • Supports RunListener
  37. 37. Groovy Event Listener Plugin (nowadays) • Allows to execute custom groovy code for every event • Supports RunListener, ComputerListener, ItemListener, QueueListener • Works at scale • Allows custom classpath
  38. 38. Groovy Event Listener Plugin if (event == 'RunListener.onFinalized') { def build = Thread.currentThread().executable def queueAction = build.getAction(TimeInQueueAction.class) def queuing = queueAction.getQueuingDurationMillis() log.info “number=$build.number, queue_duration=$queuing }
  39. 39. Ok, we have events, but how to fill the db?
  40. 40. FluentD
  41. 41. FluentD • Process 13,000 events/second/core
  42. 42. FluentD • Process 13,000 events/second/core • Retry/buffer/routing
  43. 43. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend
  44. 44. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple
  45. 45. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable
  46. 46. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable • Memory footprint is 30-40MB
  47. 47. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable • Memory footprint is 30-40MB • Ruby
  48. 48. FluentD Jenkins FluentD InfluxDB JSON JSON
  49. 49. FluentD Jenkins FluentD InfluxDB JSON JSON Postgres SQL
  50. 50. FluentD Jenkins FluentD InfluxDB JSON JSON Postgres SQL Logs
  51. 51. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  52. 52. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  53. 53. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  54. 54. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  55. 55. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>
  56. 56. Ok, we have events, we have fluentd, but how to pass event to it?
  57. 57. FluentD Plugin for Jenkins
  58. 58. FluentD Plugin for Jenkins • Developed in HERE Technologies
  59. 59. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple
  60. 60. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple • Supports JSON
  61. 61. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple • Supports JSON • Post-build-step
  62. 62. FluentD Plugin for Jenkins https://github.com/jenkinsci/fluentd-plugin
  63. 63. Great! Let’s do something with this data!
  64. 64. Infra issues
  65. 65. Build Failure Analyzer (config)
  66. 66. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  67. 67. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  68. 68. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  69. 69. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  70. 70. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  71. 71. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }
  72. 72. Build Failure Analyzer (result)
  73. 73. Speed up compilation
  74. 74. CCache (problem)
  75. 75. CCache
  76. 76. CCache • New node - empty local cache
  77. 77. CCache • New node - empty local cache • Old local cache - a lot of misses
  78. 78. CCache • New node - empty local cache • Old local cache - a lot of misses + Distributed cache solves all this problems
  79. 79. CCache • New node - empty local cache • Old local cache - a lot of misses + Distributed cache solves all this problems - Once a year distributes problem across the cluster
  80. 80. CCache (result)
  81. 81. Improve node utilization
  82. 82. LoadBalancer (problem)
  83. 83. LoadBalancer (solution)
  84. 84. LoadBalancer (solution) • Default balancer is optimized for cache
  85. 85. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts
  86. 86. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts • Nothing to terminate/stop - no idle nodes
  87. 87. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts • Nothing to terminate/stop - no idle nodes + Saturate Node Load Balancer: always put all load to the oldest node
  88. 88. LoadBalancer (result)
  89. 89. Minimize impact
  90. 90. Jar Hell (problem) java.io.InvalidClassException: hudson.util.StreamTaskListener; local class incompatible: stream classdesc serialVersionUID = 1, local class serialVersionUID = 294073340889094580
  91. 91. Jar Hell (explanation)
  92. 92. Jar Hell (explanation) • Bug in Jenkins Remoting Layer
  93. 93. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost”
  94. 94. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost” • Does not recover
  95. 95. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost” • Does not recover • Huge impact
  96. 96. Jar Hell (“solution”) if (cause.getName().equals("Jar Hell”)) { Node node = build.getBuiltOn() if (node != Jenkins.getInstance()) { node.setLabelString("disabled_jar_hell"); }
  97. 97. Our daily dashboard
  98. 98. Resources
  99. 99. Resources • FluentD • Influxdb plugin for fluentd • JavaGC plugin for fluentd • FluentD Plugin • Groovy Event Listener Plugin • Build Failure Analyzer Plugin • Saturate Node Load Balancer Plugin • CCache with memcache • InfluxDB
  100. 100. Q/A? alexander.akbashev@here.com Github: Jimilian

×