3. SRE mission at Mercari
● To ensure a reliable service that is enjoyable to use at anytime
● Takes care of all engineering apart from new service development
○ Performance improvement, automation, security etc
● Do one thing well
○ Unix philosophy
○ One function in one service, not multiple functions in one service
● Decentralized Governance
○ Each team has ownership on each service
○ Each service can be changed, upgraded, or replaced independently
○ Right framework and tool for each domain
● Software Engineer
○ Without velocity stalled, rather make feature improvement iteration speed fast
○ -> Provide great features to customers faster
○ Provide automated platform for microservice
○ Give some responsibility (e.g., deployment, debug) to software engineering
○ -> Focus on more SRE related software engineering task
25. Why Docker?
● Software engineer control more
○ They can include what they want (e.g., runtime, library)
● Environmental parity
○ What works on local development (or QA env) is exact same (easy to debug)
○ No more “it works on my environment but not in production!”
● Easy to deploy
○ Docker image ≒ Single static linked binary
○ You already know its benefit if you use Go
26. Kubernetes (GKE)
● Container orchestration
● Derives from Google internal
system named Borg & Omega
● Inspired and informed by
Google’s experiences and
27. Why kubernetes?
● Best way to maximize container benefit
○ Resource isolation/limitation enables us compute resource utilization. But how?
■ K8s can correctly schedule container proper instances
○ How to communicate between dynamically scheduled containers?
■ K8s provide the service discovery
● Reduce operation costs
○ Self healing & auto scaling
● Infrastructure of infrastructure
○ Industrial standard https://githubengineering.com/kubernetes-at-github
○ More tools/software comes top on k8s in future (I guess)
● gRPC Remote Procedure Call
● High performance, general
purpose, open source,
● Open source version of stubby
RPC in used in Google
● Simple service definition
○ By default, gRPC uses protocol buffers as the Interface Definition Language (IDL) for
describing both the service interface and the structure of the payload messages.
● Works across languages and platforms
○ Write golang server and python client
○ Utilize polyglot microservices
30. Why not REST?
● Who can implement REST correctly?
○ High cost to design (Path? Parameters? hah?)
○ Eventually it’s just HTTP endpoints
● No more HTTP client implementation ..
● Deployment is key in microservices platform
○ “Without velocity stalled, rather make iteration speed faster”
● We need easy & safe automated deployment system
○ We started chatbot style deployment but it was not scale
37. Why Spinnaker?
● Kubernetes support
● Built-in deployment best practice from Netflix and Google
○ Immutable infrastructure
○ Blue/Green deployment, Canary deployment
○ Manual judgement (by manager) phase
○ Run integration tests
38. Spinnaker in Mercari
● Currently only for container deployment to kubernetes
● Each team uses spinnaker to deploy their own services
● One spinnaker handles all microservices in all region
Selection of metrics service/software is still on-going discussion & trial
● First support of container and kubernetes
● Integration with kubernetes ecosystem
○ Spinnaker, istio and so on
● Service dependency visualization
Testing in microservice is hard?
● Focus on unit tests as usual
○ Because each service is supposed to independent
○ Each microservices must measure testing coverage
● Integration tests?
○ Use mock instead of working hard for preparing local env
67. Service mesh
Don’t trust each other!
● Traffic management
○ API rate limit, circuit breaker
● Policy enforcement
○ Ensure access policies (which service can access which service?)
We should realize above without modifying client/server code!
70. Chaos engineering
● Real world is hard …
○ machine is crashed, network is unstable (especially in distributed system)
● Dependent service fails anytime
71. Chaos engineering
● Service must be fault tolerance whenever something wrong
● Emulate real world problem
○ We need to identify weaknesses
■ Improper fallback settings when a service is unavailable
○ Software Engineer should be aware