The document discusses production ready microservices at scale. It outlines a 5 point checklist for building production ready microservices: 1) Code Quality, 2) Testing, 3) Resilience Patterns, 4) Observability, and 5) CI/CD Pipelines. It provides examples for each point from GO-JEK's implementation, such as using tools like Rubocop for code quality, contract testing, circuit breakers for resilience, Grafana/NewRelic for observability, and canaries for stable deployments. The overall goals are to provide availability, stability, reliability, and increased development velocity through standardized microservices.
9. Linting/
Formatting
• Statically analyse linting and
formatting errors
• Follow Ruby Style Guide
• Helps maintain Code sanity
• Use tools like Rubocop
• `rubocop -a` to autofix certain
errors
10.
11. Code Smells
• Statically analyse code for code
smells like Long Parameter list,
Cyclomatic Complexity,
Uncommunicative name etc
• Easy to change codebases
• Use tools like Reek
12.
13. Secure coding
• Statically analyse code for Security
vulnerabilities
• Identify security issues like XSS,
CSRF, SQL Injection etc
• Use tools like Brakeman
• Identify issues with various
confidence levels
14.
15. #1. Code Quality
• Make it part of your development
process
• Also deployment pipeline
• Helps avoid bugs creeping in the
code
• Helps achieve Stability and
Reliablity of microservice
• Also helps achieve security
17. Unit/Integration Testing
• Unit tests test a unit/function (Individual component)
• Integration tests help in testing components working together
• Contract testing helps test interactions of your micro service with its
dependencies
21. Load Testing
• Expose to higher traffic loads to see behaviour
• Makes sure that microservice is ready to accept increased traffic in
future
• Helps identify scalability challenges and bottlenecks
• Helps in determining SLA of a service
• Load testing at GO-JEK using Gatling
22. Chaos Testing
• Helps determine Unknown Unknowns
• Type of Resiliency testing
• Break systems to understand behaviour
• At GO-JEK ? Manual
• Automated ? Simian Army
23. #2. Testing
• Can be automated
• Make it part of your development
process
• Also deployment pipeline
• Builds confidence in code
• Helps with Fault tolerance,
Scalability, Performance, Stability
and Reliability
25. Timeouts/Retries
• Stop waiting for an answer after certain time - Timeout
• Timeout - Fail Fast
• Retry the request after failure
• Retry - Succeed eventually
27. Circuit Breaker
• Circumvent calls to dependency when calls fail
• Fallback to a fallback response
• Hystrix is a well known implementation
• jruby-hystrix used internally at GO-JEK (Yet to be open sourced)
29. module CustomerClient
module Hystrix
class HttpCommand < ::Hystrix::HystrixBaseCommand
SERVER_DOWN_HTTP_STATUS_CODES = [500, 501, 502, 503, 504]
def run
http_response = nil
begin
http_response = request_uri.send(request_method, request_params, request_headers) do|
callback|
callback.on(400..511) do |response|
Wrest.logger.warn("<- #{request_method} #{request_uri}")
Wrest.logger.warn("code: #{response.code}: body: #{response.deserialised_body}")
end
end
if SERVER_DOWN_HTTP_STATUS_CODES.include?(http_response.code.to_i)
raise "Server Error with response code #{http_response.code.to_i}"
end
build_response(http_response)
rescue => e
raise HttpCommandException.new(request_uri, request_headers, request_params,
http_response), e.message
end
end
end
end
end
jruby-hystrix
30. #3. Resiliency
Patterns
• Protect systems from failures of its
dependencies
• Protect dependency when it has
started to fail
• Fail Fast and Succeed Eventually
• Helps with Fault tolerance,
Scalability, Stability and Reliability
32. Logging
• Describes state of the micro service
• Helps debugging root cause of an issue
• Structured logging
• Logging needs to be scalable, available and easily accessible (Log
centralisation)
• Logging in Ruby
33.
34. Instrumentation in Code
• Microservice Key metrics (Latencies, errors, api responses etc) -
newrelic_rpm
• StatsD metrics (statsd-instrument) - Custom metrics
• Exception and error tracking metrics (sentry-raven)
35. StatsD
Initialisation:
`StatsD.backend = case Rails.env
when 'production', 'staging'
StatsD::Instrument::Backends::UDPBackend.new(“local:8125”, :datadog)
when 'test'
StatsD::Instrument::Backends::NullBackend.new
else
StatsD::Instrument::Backends::LoggerBackend.new(StatsD.logger)
end`
Counter:
StatsD.increment(‘customer.login.device.ratelimit.count’)`
Gauge:
StatsD.gauge(‘booking.queued', 12, sample_rate: 1.0)
36. Dashboards
• Capture Key metrics
• USE (Utilisation, Saturation, Errors)
• RED (Request Rate, Error Rate, Duration)
• CPU, RAM, database connections, latencies etc
• Should represent Health of a service
• Grafana/NewRelic dashboards at GO-JEK
40. Alerting
• Alert when failures are detected or changes in key metrics
• Alerts should be actionable
• Answered by people on call/support
• Helps protect availability of systems before they go down
• Alerting at GO-JEK
41. #4. Observability
• Provides insights into how a service
is behaving and its overall health
• Includes 3 main categories:
Logging, Monitoring and
Distributed Tracing
• Helps in quick incidence response
during outages
• Helps maintain Stability of the
service
43. CI/CD
• Increased Deployment Velocity - 30+ deploys a day
• Bad Deployments major cause of downtimes
• Rollout - Staging -> Production
• Rolling Deploys - Incremental
• Think of Rollbacks at every deployment
45. Canary
• Deploy the new change to single node or subset of total nodes
• Monitor key metrics for certain duration
• Promote Canary to Production if all good or else Rollback
• Helps minimise impact of bad deployments in Production
51. More …
• Documentation
• Security
• Capacity Planning
• Outages and Incidence Response (Runbooks)
52. Future work
• Implementing Production Readiness checklist on the ground
• Review process for micro services
• Production Readiness score
• Automating the checklist