Building trust within the organization, first steps towards DevOps


Published on

presented at Berlin DevOps meetup

Published in: Technology, Education

Building trust within the organization, first steps towards DevOps

  1. 1. Building trust within theorganization, first steps towards DevOps Guido Serra, txtr
  2. 2. What’s the role of a DevOp(s)?• Deliver• Be bridge of trust between DEVs and SysOPs• Stop the “throw the ball over the fence” game• Mediate• Drive non-functional requirements … DevOp or DevOps, talking of one? 
  3. 3. Introduce a DevOp(s)• In ‘txtr, starting as a QA Manager, specialised on backend systems, seems to have worked• Other organizations tends to call it Site Reliability Engineer / Site Reliability Operation• But… QA != Testing, not strictly at least – Testing should be only a subset of QA, but that is not how it is normally perceived – Non-functional requirements did not seem to fit in
  4. 4. Non-functional requirements?• Functional requirements == features• Non-functional requirements == everything that OPS would need to run the service, or even things that Product Owners would want but has not thought of at the design time – Logging • Which kind of informations? • How? – Health checks / Load Balancer required URL – Live sales report / Dashboard / Charting
  5. 5. Steps that worked so far• Listen …to OPS, to PMs, to QA, to R&D• See how the people have solved their specific needs trying to gather informations• Match all the tools that have been built• Try to gather the essence of those tools, and come up with non-functional requirements• Discuss those with the R&D organization and push them at Product level to be prioritized over features
  6. 6. TRUSTMeans…• Not having to duplicate work – wrongly testing the backend to see if it is answering – or testing to measure the response times – or creating tests again, when there are plenty of them that are simply not shared and/or broadly understood
  7. 7. The answer is 42?…no, the answer is DATA!• Creating a single point of data collection and graphing, people are gaining trust in the backend• Logs need to be shared too• Tests needs to be commonly understood
  9. 9. Tools• Logging – Slf4j > Log4j / JUL > GELF > GrayLog2 • Logging to syslog from a Java based backend, is pretty bad. The stacktrace become very hard to be fetched and reported in a ticket. Instead, one link and a screenshot, or a cut&paste of a complete stacktrace from a web interface is much more easy to be digested • GELF is a notification format, encapsulating the full stacktrace as a message • GrayLog2 is a ruby/MongoDB FIFO queue with a nice web interface, and an alerting email system
  10. 10. Why?• Slf4j – It is an abstraction layer on logging facilities • I’ll not explain why an “abstraction layer” is good• Log4j or JUL, at your choice – They are the most commonly used • Means: their code is maintained• GELF – It keeps a full stacktrace in a single message. There is no need of reconstructing it from syslog, spread on multiple lines and with additional garbage/timestamps• GrayLog2 – We have an in-house developer, and it is working pretty well – Has threshold based alerting per streams of events (regexp)
  11. 11. Results seen so far• 1st level support team is gaining trust in the application. – Logs are getting more and more readable – Events can be correlated much more easily• 2nd level support (OPS) can set thresholds of alerts and react promptly, having alerts tight to real traffic data and not “one time probes”• I have a better feeling of the trend of issues in production, and I don’t have to dig for logs
  12. 12. Instrumented metricsPRODUCTION PERFORMANCE
  13. 13. Tools• Instrumented metrics – JMX > Jolokia > JSON > Graphite • MIN / MAX / AVG response time of each API • Worst response times with related API parameters • Success / failure counters • All the above aggregated over the last 5 / 15 minutes, 1 hour, 24 hours • Plus all the standard exposed JConsole / JMX infos
  14. 14. Why?• JMX – It is built in in Java, and it is non-invasive • R&D loves it, cause it does not need an invasive agent as many profiling agents that are normally used in such cases. Standard profiling agents tend to interfere with the application and decrease the overall performance. – It is a standard, so there are many tools that plug into it natively• Jolokia – It is a standard tool that plugs into JMX and expose it as JSON encoded format
  15. 15. Why?• Graphite – It can correlate data from many sources – Gives me the freedom of structuring graphs as I want, directly from the web interface • This is a definitive WIN over Munin or Cacti – It lets me select specific timeframes • In case of outage investigation. Thing which is not possible with Munin – Can create dashboards
  16. 16. Data are in transactions “per 5 minutes” in this graph…you can see this specific service is currently being used
  17. 17. 100 transactions per seconduhmm… at 7a.m., ok 11a.m. in Indiasomeone is testing…
  18. 18. Results seen so far• No need of load and performance testing – Apart of specific cases, to try to reproduce the issue to let DEVs work on it. – Producing a proper load test is problematic, and can bring to false assumptions about the product. Having the possibility to watch what the business logic is doing in production is the best load test.• DEVs are proactively watching and fixing performance issues on their own. The overall product gets better and better.
  20. 20. Tools• Testing – BDD / Cucumber-Nagios executed by Jenkins • Cover all the fast HTTP action via Watir • API calls via JsonRPC or Soap4Rr • Javascript based UI via Selenium / Capybara• These tests are actually very valuable at deployment time, since there is no need of manual testing. All is in the hand of whom follows the deployment.
  21. 21. Why?• BDD – Not everyone wants to read your code – Not everyone is a coder – You don’t want to have to explain your test again and again and again, and you hate documenting• Cucumber-Nagios / Ruby – It is off-the-shelf, it works. – It generates standard JUnit XML report • Means: it directly integrates with Jenkins ( ex Hudson ) – It generates an awesome HTML report – It can be extended pretty easily
  22. 22. Why?• Watir – It is the default HTTP client in Cucumber-Nagios • BUT: it has tons of bugs… I have a long backlog to fix – It is fast• Soap4r – Pretty easy SOAP ruby gem/library• JsonRPC – Very simple and basic JSON RPC gem/library • BUT: it does not support proxy settings
  23. 23. Why?• Selenium – Cause it is the only one? – It supports Javascript – It supports clustering of testing nodes – It is supposed to be easy to integrate with Cucumber (it is NOT …I’m working on it)
  24. 24. Upcoming…• Health checks (normally used for load balancing purposes) are based on business logic historical data from within the instrumented metrics• Continuous integration – Configuration management• Data mining
  25. 25. guido.serra@txtr.comQUESTIONS?