Service Mesh vs. Frameworks:
Where to put the resilience?
Michael Hofmann
https://hofmann-itconsulting.de
(1) Distributed Systems and Resilience
(2) Framework
(3) Service Mesh
(4) Framework and Service Mesh Characteristics
(5) Thoughts about Resilience
(6) Essential Requirements
(7) Conclusion
Agenda
Distributed Systems
➔ degree of distribution raises failure rate!
➔ compensation strategy: resilience!
slow response
timeout
aborted network connection
...
Typical Communication Errors Fallacies of Distributed Computing
The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.
Hystrix
Alternative: Service Mesh?!
Resilience
Resilience4J
Failsafe
MicroProfile Fault Tolerance
…
Framework „DEATH“ Framework ACTIVE
Resilience Patterns
─
Timeout
─
Retry
─
Fallback
─
Circuit Breaker
─
Bulkhead
many more:
Uwe Friedrichsen: “Patterns of resilience” https://www.slideshare.net/ufried/patterns-of-resilience
@CircuitBreaker(successThreshold = 10,
requestVolumeThreshold = 4, failureRatio=0.5, delay = 1000)
public Connection serviceA() {
Connection conn = null;
counterForInvokingServiceA++;
conn = connectionService();
return conn;
}
MicroProfile Fault Tolerance
@Retry(maxRetries = 3)
@Fallback(fallbackMethod = "doFallback")
public Result doWork() {
return callServiceA(); // fallback on RuntimeException
}
private Result doFallback() {
return ...;
}
Service Mesh
The term service mesh is used to describe the
network of microservices that make up such
application and the interactions between them.
(istio.io)
Don’t manage a Service Mesh without tooling!
Requirements:
(1) manage calls on layer 7 (application layer, L7)
(2) resilience, routing, security and telemetry
(3) decentralized & transparent for services (implementation independent)
Istio Architecture
Resilience Patterns in Istio
✔
Timeout
✔
Retry
✔
CircuitBreaker
✔
Bulkhead
✗
Fallback?
✗
is a Fallback possible?
✗
less technical, more business driven
https://dzone.com/articles/fallbacks-are-overrated-architecting-for-resilienc
Resilience in Istio
$ kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: reviews
spec:
hosts:
- reviews
http:
- route:
- destination:
host: reviews
subset: v1
retries:
attempts: 3
perTryTimeout: 2s
EOF
$ kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: reviews
spec:
host: reviews
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1
http:
http1MaxPendingRequests: 1
maxRequestsPerConnection: 1
outlierDetection:
consecutiveErrors: 1
interval: 1s
baseEjectionTime: 3m
maxEjectionPercent: 100
EOF
Resilience in Istio
Apply to sidecar
Resilience rules
— transparent for service
— act global on all sidecars
Fault Injection
MicroProfile with Istio setting
apiVersion:networking.istio.io/
v1alpha3
kind: VirtualService
metadata:
name: ratings
...
spec:
hosts:
- ratings
http:
- fault:
delay:
fixedDelay: 7s
percent: 100
MP_Fault_Tolerance_NonFallback_Enabled = false
Frameworks Characteristics
—
Java: a lot of different frameworks
—
Team decides framework?!?
—
Learning curve for every framework
—
Different frameworks behave different
—
Same framework in different version behave different
—
Same framework in different versions parallel in use
Frameworks Characteristics
➔ Change of framework:
➔ Replace all positions in code
➔ New behavior
➔ New deployment
➔ New tests
➔ Risk of chain reaction:
framework ➔ load balancing ➔ service registry
➔ Multiple service registries for every different framework?
Service Mesh Characteristics
—
Define new rule
—
Same behavior (… no framework change)
—
unchanged deployed service
—
new tests only for new rules
—
Client-side load balancing in sidecar
—
Service Registry based on endpoints in K8S
$ kubectl apply -f ...
Thoughts about Resilience
Resilience pattern still correct if communication behavior changes?
—
Modified behavior of partner
—
Modified communication partner
—
Modified infrastructure
—
Load changes during day
—
Side effects from other systems
—
Anticipate problems of tomorrow?
Thoughts about Resilience
—
Main problem: choose the right resilience pattern
—
Correct parameters for pattern?
—
Measure resilience
—
Mostly: try & error for suitable pattern/params
(main reason for end of life in hystrix)
—
Often: retry storm
—
Often: missing musketeer principle
(black sheep)
Essential Requirements
—
Modification: Quick and easy change of
(1) params for chosen pattern
(2) resilience pattern
—
Test
—
Monitoring
—
No black sheep
Essential Requirements
Istio Framework
Modification
+ Modify Params
- Change Pattern: Lifecycle
Test Fault Injection complicated
Monitoring + +
Black Sheep
No:
rule in all sidecars
$ kubectl apply -f ...
Conclusion
—
Comparable resilience patterns
—
Missing fallback in service mesh (but overrated)
—
Higher flexibility in service mesh
—
Fault injection easy in service mesh
Solve problems where they arise!
Service Mesh for L4-L7
Developer for L8 (original profession)

Service Mesh vs. Frameworks: Where to put the resilience?

  • 1.
    Service Mesh vs.Frameworks: Where to put the resilience? Michael Hofmann https://hofmann-itconsulting.de
  • 2.
    (1) Distributed Systemsand Resilience (2) Framework (3) Service Mesh (4) Framework and Service Mesh Characteristics (5) Thoughts about Resilience (6) Essential Requirements (7) Conclusion Agenda
  • 3.
    Distributed Systems ➔ degreeof distribution raises failure rate! ➔ compensation strategy: resilience! slow response timeout aborted network connection ... Typical Communication Errors Fallacies of Distributed Computing The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn't change. There is one administrator. Transport cost is zero. The network is homogeneous.
  • 4.
    Hystrix Alternative: Service Mesh?! Resilience Resilience4J Failsafe MicroProfileFault Tolerance … Framework „DEATH“ Framework ACTIVE
  • 5.
    Resilience Patterns ─ Timeout ─ Retry ─ Fallback ─ Circuit Breaker ─ Bulkhead manymore: Uwe Friedrichsen: “Patterns of resilience” https://www.slideshare.net/ufried/patterns-of-resilience
  • 6.
    @CircuitBreaker(successThreshold = 10, requestVolumeThreshold= 4, failureRatio=0.5, delay = 1000) public Connection serviceA() { Connection conn = null; counterForInvokingServiceA++; conn = connectionService(); return conn; } MicroProfile Fault Tolerance @Retry(maxRetries = 3) @Fallback(fallbackMethod = "doFallback") public Result doWork() { return callServiceA(); // fallback on RuntimeException } private Result doFallback() { return ...; }
  • 7.
    Service Mesh The termservice mesh is used to describe the network of microservices that make up such application and the interactions between them. (istio.io) Don’t manage a Service Mesh without tooling! Requirements: (1) manage calls on layer 7 (application layer, L7) (2) resilience, routing, security and telemetry (3) decentralized & transparent for services (implementation independent)
  • 8.
  • 9.
    Resilience Patterns inIstio ✔ Timeout ✔ Retry ✔ CircuitBreaker ✔ Bulkhead ✗ Fallback? ✗ is a Fallback possible? ✗ less technical, more business driven https://dzone.com/articles/fallbacks-are-overrated-architecting-for-resilienc
  • 10.
    Resilience in Istio $kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: reviews spec: hosts: - reviews http: - route: - destination: host: reviews subset: v1 retries: attempts: 3 perTryTimeout: 2s EOF $ kubectl apply -f - <<EOF apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: reviews spec: host: reviews trafficPolicy: connectionPool: tcp: maxConnections: 1 http: http1MaxPendingRequests: 1 maxRequestsPerConnection: 1 outlierDetection: consecutiveErrors: 1 interval: 1s baseEjectionTime: 3m maxEjectionPercent: 100 EOF
  • 11.
    Resilience in Istio Applyto sidecar Resilience rules — transparent for service — act global on all sidecars Fault Injection MicroProfile with Istio setting apiVersion:networking.istio.io/ v1alpha3 kind: VirtualService metadata: name: ratings ... spec: hosts: - ratings http: - fault: delay: fixedDelay: 7s percent: 100 MP_Fault_Tolerance_NonFallback_Enabled = false
  • 12.
    Frameworks Characteristics — Java: alot of different frameworks — Team decides framework?!? — Learning curve for every framework — Different frameworks behave different — Same framework in different version behave different — Same framework in different versions parallel in use
  • 13.
    Frameworks Characteristics ➔ Changeof framework: ➔ Replace all positions in code ➔ New behavior ➔ New deployment ➔ New tests ➔ Risk of chain reaction: framework ➔ load balancing ➔ service registry ➔ Multiple service registries for every different framework?
  • 14.
    Service Mesh Characteristics — Definenew rule — Same behavior (… no framework change) — unchanged deployed service — new tests only for new rules — Client-side load balancing in sidecar — Service Registry based on endpoints in K8S $ kubectl apply -f ...
  • 15.
    Thoughts about Resilience Resiliencepattern still correct if communication behavior changes? — Modified behavior of partner — Modified communication partner — Modified infrastructure — Load changes during day — Side effects from other systems — Anticipate problems of tomorrow?
  • 16.
    Thoughts about Resilience — Mainproblem: choose the right resilience pattern — Correct parameters for pattern? — Measure resilience — Mostly: try & error for suitable pattern/params (main reason for end of life in hystrix) — Often: retry storm — Often: missing musketeer principle (black sheep)
  • 17.
    Essential Requirements — Modification: Quickand easy change of (1) params for chosen pattern (2) resilience pattern — Test — Monitoring — No black sheep
  • 18.
    Essential Requirements Istio Framework Modification +Modify Params - Change Pattern: Lifecycle Test Fault Injection complicated Monitoring + + Black Sheep No: rule in all sidecars $ kubectl apply -f ...
  • 19.
    Conclusion — Comparable resilience patterns — Missingfallback in service mesh (but overrated) — Higher flexibility in service mesh — Fault injection easy in service mesh Solve problems where they arise! Service Mesh for L4-L7 Developer for L8 (original profession)