RESILIENT	ARCHITECTURE
Matt	Stine	( )@mstine
http://www.mattstine.com
HEADLINES
A	SYSTEM	FAILURE	COSTS	A	WELL-KNOWN
RETAILER	SIGNIFICANT	REVENUE	ON	THE
BIGGEST	INTERNET	SHOPPING	DAY	OF
THE	YEAR.
A	SYSTEM	FAILURE	CAUSES	THE
CANCELLATION	OF	HUNDREDS	OF
FLIGHTS,	STRANDING	THOUSANDS	OF
AIRLINE	PASSENGERS,	AND	ULTIMATELY
COSTING	THE	AIRLINE	MILLIONS	IN
REVENUE.
A	BEAUTIFULLY	DESIGNED	ONLINE	STORE
CRUMBLES	UNDER	THE	PRESSURE	OF	A
THUNDERING	HERD	OF	CUSTOMERS
TRYING	TO	PURCHASE	THE	LATEST	TECH
GADGET.
A	SECURITY	BREACH	EXPOSES
THOUSANDS	OF	CUSTOMER	CREDIT	CARD
NUMBERS,	LEADING	TO	MILLIONS	IN	LOST
REVENUE	DUE	TO	THE	RESULTING	LOSS
OF	TRUST.
WHAT	CAN	WE	DO?
DISRUPTIVE	COMPANIES	ARE
ALSO	APPROACHING	RESILIENCY
DIFFERENTLY.
STOP	TRYING	TO	PREVENT
MISTAKES.
EMBRACE	FAILURE.
FROM	MTBF	TO	MTTR
WE	NEED	BETTER	TOOLS	AND
TECHNIQUES.
RESILIENT	ARCHITECTURES
Enhance	Observability
Leverage	Resiliency	Patterns
Embrace	Chaos
ENHANCE	OBSERVABILITY
SEE	FAILURE	WHEN	IT	HAPPENS
MEASURE	EVERYTHING
WHAT	IS	NORMAL?
Values
Rates	of	Change
Mean?
P95/99/99.9?
WHAT	IS	NORMAL?
http://bravenewgeek.com/everything-you-know-about-latency-is-wrong/
SPRING	BOOT	HEALTH
ENDPOINT
{
"diskSpace": {
"status": "UP",
"total": 1056858112,
"free": 878850048,
"threshold": 10485760
},
"refreshScope": {
"status": "UP"
},
"configServer": {
"status": "UP",
"propertySources": [
"configClient",
"https://github.com/spring-cloud-services-samples/fortune-teller/configuration/application.yml"
]
},
"hystrix": {
SPRING	BOOT	INFO	ENDPOINT
"git": {
"build": {
"host": "Matts-MacBook-Pro.local",
"version": "0.0.1-SNAPSHOT",
"time": 1489021333000,
"user": {
"name": "Matt Stine",
"email": "mstine@pivotal.io"
}
},
"branch": "master",
"commit": {
"message": {
"short": "initial commit",
"full": "initial commit"
},
"id": "9b624974e417693cf921b9abc50b5af4ea0b6dde",
"id.describe-short": "9b62497-dirty",
"id.abbrev": "9b62497",
"id.describe": "9b62497-dirty",
DISTRIBUTED	TRACING
Zipkin
EXAMPLES:
Spring	Boot	Actuator
http://docs.spring.io/spring-boot/docs/current/reference/htmlsingle/#production-ready
PCF	Apps	Manager
https://docs.pivotal.io/pivotalcf/1-9/console/using-actuators.html
Spring	Cloud	Sleuth
https://cloud.spring.io/spring-cloud-sleuth/
Zipkin
http://zipkin.io/
LEVERAGE	RESILIENCY
PATTERNS
TIMEOUTS
TIMEOUTS
Thinking	is	half	the	battle!
Anything	that	blocks	threads
Any	method	call	with	an	optional	timeout	argument
ADDING	TIMEOUTS	TO
RESTTEMPLATE
@Bean
public RestTemplate restTemplate() {
SimpleClientHttpRequestFactory clientHttpRequestFactory
= new SimpleClientHttpRequestFactory();
clientHttpRequestFactory.setConnectTimeout(10 * 1000); // Ten seconds!
clientHttpRequestFactory.setReadTimeout(10 * 1000); // Ten seconds!
return new RestTemplate(clientHttpRequestFactory);
}
RETRIES
RETRIES
Potentially	transient	failures
Immediately
With	a	backoff
Maximum	times
Log	all	the	things
SIMPLE	RETRY
@RequestMapping("/acquireThings")
@Retryable
public ResponseEntity<String> tryToAcquireThings() {
logger.info("Attempting to acquire things...");
String things = restTemplate
.getForObject("http://localhost:8081/things", String.class);
return new ResponseEntity<String>(things, HttpStatus.OK);
}
@Recover
public ResponseEntity<String> recover() {
logger.warn("Returning default response...");
return new ResponseEntity<String>("default things", HttpStatus.OK);
}
RETRY	WITH	BACKOFF
@RequestMapping("/acquireThings")
@Retryable(maxAttempts = 5,
backoff = @Backoff(delay = 100L, maxDelay = 1000L,
multiplier = 2, random = true)
)
public ResponseEntity<String> tryToAcquireThings() {
logger.info("Attempting to acquire things...");
String things = restTemplate
.getForObject("http://localhost:8081/things", String.class);
return new ResponseEntity<String>(things, HttpStatus.OK);
}
EXPONENTIAL	BACKOFF
@Bean
public BackOffPolicy backOffPolicy() {
return new ExponentialBackOffPolicy();
}
BULKHEADS
BULKHEADS
Microservices
Thread	Pools
Availability	Zones
CIRCUIT	BREAKERS
CIRCUIT	BREAKERS
SPRING	CLOUD	HYSTRIX
@HystrixCommand(fallbackMethod = "fallbackFortune")
public Fortune randomFortune() {
return restTemplate.getForObject("http://fortunes/random", Fortune.class);
}
private Fortune fallbackFortune() {
return new Fortune(42L, fortuneProperties.getFallbackFortune());
}
EXAMPLES:
Spring	Retry
https://github.com/spring-projects/spring-retry
Hystrix
https://github.com/Netflix/Hystrix
via	Spring	Cloud	Netflix
https://cloud.spring.io/spring-cloud-netflix/
EMBRACE	CHAOS
HOW	DO	YOU	KNOW	YOUR
SYSTEM	WILL	TOLERATE	FAILURE
IF	IT	HASN'T	FAILED?
GAME	DAY	EXERCISES
CAN	WE	DIAL	THAT	UP	A	NOTCH?
YAU	AND	CHEUNG:
DESIGN	OF	SELF-CHECKING	SOFTWARE
(1975)
DID	SOMEBODY	SAY...
EXAMPLES:
Chaos	Lemur	(BOSH)
https://github.com/strepsirrhini-army/chaos-lemur
Chaos	Loris	(CF)
https://github.com/strepsirrhini-army/chaos-loris
REVIEW	TIME!
Stop	trying	to	prevent	mistakes
Focus	on	MTTR
Enhance	observability
Leverage	resiliency	patterns
Embrace	chaos!
THANKS!
Matt	Stine	( )@mstine
http://www.mattstine.com

Resilient Architecture