Jeffrey Richter

Architecting Distributed
Cloud Applications
Jeffrey Richter Software Architect
(Azure)
Microsoft

Jeffrey Richter: Microsoft Azure Software Architect,
Wintellect Co-Founder, & Author
JeffreyR@Microsoft.com
www.linkedin.com/in/JeffRichter
@JeffRichter

 We must do things differently when building
cost-effective, failure-resilient solutions
Why cloud apps?
Past Present
Clients Enterprise/Intranet Public/Internet
Demand Stable (small) Dynamic (small  massive)
Datacenter Single tenant Multi-tenant
Operations People (expensive) Automation (cheap)
Scale Up via few reliable (expensive) PCs Out via lots of (cheap) commodity PCs
Failure Unlikely but possible Very likely
Machine loss Catastrophic Normal (no big deal)
Examples Past Present
Exceptions Catch, swallow & keep running Crash & restart
Communication In order
Exactly once
Out of order
Clients must retry & servers must be idempotent

 Some reasons why a service instance may fail (stop)
 Developer: Unhandled exception
 DevOps: Scaling the number of service instances down
 DevOps: Updating service code to a new version
 Orchestrator: Moving service code from one machine to another
 Force majeure: Hardware failure (power supply, fans [overheating], hard disk,
network controller, router, bad network cable, etc.)
 Force majeure: Data center outages (natural disasters, attacks)
 Since failure is inevitable & unavoidable, embrace it
 Architect assuming failures will happen
 Operate services using infrastructure that avoids single points of failure
 Run multiple instances of services, replicate data, etc.
Cloud computing is all about embracing failure

Region
Load
Balancer
 Infrastructure/Platform/Containers/Functions as a Service
 Manage lifecycle, health, scaling, & upgrades for PC/VM, networking
& service code
Orchestrators
PC/VM
PC/VM
PC/VM
PC/VM
PC/VM
PC/VM

E-Commerce Application
Load
Balancer
Applications consist of many (micro)services
Inventory #1
Inventory #2
Orders #1
Orders #2
Orders #3
Orders #4
Web Site #1
Web Site #2
Web Site #3
Each service solves a domain-
specific problem & has exclusive
access to its own data store

Thumbnail
Service
Thumbnail
ServicePhoto Share
Service
Photo Share
Service
Photo Share
Service
4 reasons to split a monolith into microservices
Photo Share
Service
Thumbnail
Service
Photo Share Service
Thumbnail
SharedLib-v7
Photo Share
Service
SharedLib-v1
Photo Share
Service
node.js
Thumbnail
Service
.NET
Photo Share
Service (V1) Thumbnail
Service
V1
Thumbnail
Service
SharedLib-v7
Thumbnail
Service
V2
SharedLib-v1
Video Share
Service (V1)
Backward compatibility
must be maintained

 Myth: Microservices offer small,
easy-to-understand/manage code bases
 A monolith can use OOP & libraries (requires developer discipline)
 Library changes cause build failures (not runtime failures)
 Myth: A failing service doesn’t impact other services
 Many services require dependencies be fully functioning
 Hard to write/test code that gracefully recovers when dependency fails
 We run multiple service instances so there is no such thing as “failure”
 A monolith is up/down completely; no recovery code
 Infrastructure restarts failed instances keeping them up
Microservice architecture benefits myths

Composing SLAs for dependent services
Service-A Service-B Service-C Service-D
99.990% (264s/month)
99.998% ( 52s/month)
99.985% (396s/month)
99.997% ( 78s/month)
99.980% (528s/month)
99.996% (104s/month)
99.995% (132s/month)
99.999% ( 26s/month)
What about the network’s SLA?

http://12factor.net
12-Factor Services (Apps)

1. Single root repo; don’t share code with another service
2. Deploy dependent libs with service
3. No config in code; read from environment vars
4. Handle unresponsive service dependencies robustly
5. Strictly separate build, release, & run steps
 Build: Builds a version of the code repo & gathers dependencies
 Release: Combines build with config  ReleaseId (immutable)
 Run: Runs service in execution environment
12-factor services (1-5)

6. Service is 1+ stateless processes & shares nothing
7. Service listens on ports; avoid using (web) hosts
8. Use processes for isolation; multiple for concurrency
9. Processes can crash/be killed quickly & start fast
10. Keep dev, staging, & prod environments similar
11. Log to stdout (dev=console; prod=file & archive it)
12. Deploy & run admin tasks (scripts) as processes
12-factor services (6-12)

8 fallacies of distributed computing
http://www.rgoarchitects.com/Files/fallacies.pdf
Fallacy Effect
The network is reliable App needs error handling/retry
Latency is zero App must restrict its traffic
Bandwidth is infinite App must restrict its traffic
The network is secure App must secure its data/authenticate servers
Topology doesn't change Changes affect latency & bandwidth
There is one administrator Changes affect ability to reach destination
Transport cost is zero Costs must be budgeted
The network is homogeneous Affects reliability, latency, & bandwidth

 We run multiple instances of a service
 For service failure/recovery & scale up/down
 So, instances’ endpoints dynamically change over the service’s lifetime
 Ideally, we’d like to abstract this from client code
 Each client wants a single stable endpoint as the face of the
dynamically-changing service instance endpoints
 Typically, this is accomplished via a reverse proxy
 NOTE: Every request goes through the RP; causes an extra network hop
 We’re losing some performance to gain a lot of benefits
 Client uses DNS (at well-known static endpoint) to get RP’s stable endpoint
 DNS endpoints are usually cached & re-resolved infrequently
Service high-availability & scalability

Forward & reverse proxies
Client-1
Client-2
(Forward)
Proxy
Server-1
Reverse
Proxy
Server-2

Cluster DNS & service reverse proxy
Load
Balancer
Web Site #1
Web Site #2
Web Site #3
Inventory #1
Inventory #3
Inventory #2
Orders #1
Orders #2
⚠ WS #1 could fail
before I #3 replies
⚠

 Comparing an in-process call to a network request
 Performance: Worse, increases network congestion, unpredictable
 Unreliable: Requires retry loops with exponential backup/circuit breakers
 Server code must be idempotent
 Security: Requires authentication, authorization, & encryption
 Diagnostics: network issues, perf counters/events/logs, causality/call stacks
Turning a monolith into a microservice
IntelliSense, refactoring & compile-time type-safety)

 Define explicit, formal cross-language API/data contracts
 “Contracts” defined via code do not work; do not do this
 Ex: DateTime can be null in Java but not in .NET
 Use cross-language data transfer formats
 Ex: JSON/XML, Avro, Protocol Buffers, FlatBuffers, Thrift, Bond, etc.
 Consider embedding a version number in the data structure
 Optional: (De)serialize data into language-specific types
 Beware of RAM/CPU costs with this; keep types “disposable” (not contracts)
Defining network API contracts

 Technologies try to map method call  network request
 Examples: RPC, RMI, CORBA, DCOM, WCF, etc.
 These frequently don’t work well due to
 Network fallacies (lack of retry/circuit breaker)
 Language-specific data type conversions (ex: dates, times, durations)
 Versioning: Which version to call on the server?
 Authentication: expiring tokens
 Logging: Log request parameters/headers/payload, reply headers/payload?
Beware leaky RPC-like abstractions

http://ReactiveManifesto.org/
Messaging Communication

 The request/reply pattern is frequently not the best
 Client sends to server but selected server may be busy; other server may be idle
 Client may crash/scale down/reconfigure while waiting for server’s reply
 So, consider messaging communication instead
 Resource efficient
 Client doesn’t wait for server reply (no blocked threads/long-lived locks)
 Idle consumers pull work vs busy consumer pushed more work
 Consumers don’t need listening endpoints; producers talk to queue service
 Resilient: Producer/consumer instances can come, go, and move at will
 If consumer fails, another consumer processes the message (1+ delivery, not ordered)
 Consumers/producers can be offline without message loss
 Elastic: Use queue length to determine need to scale up/down
Messaging communication

Messaging with queues
Load
Balancer
WebSite #1
WebSite #2
WebSite #3
Service-A
#1
Service-A
#3
Service-A
#2
Service-B
#1
Service-B
#2
🛈 Request/reply isn’t required; Service-B #1
could post to Q-WS1; not to Q-A
🛈 All Service-A instances could
go down; but not WebSite #1

 Building reliable & scalable services that manage state is
substantially harder than building stateless services
 Due to data size/speed, partitioning, replication, consistency, disaster recovery,
backup/restore, costs, administration, security, etc.
 Because of this, most devs do not build their own stateful
services; they use a robust/hardened service instead
 When selecting a stateful service, you must fully understand your service’s
requirements and understand the trade-offs when comparing available services
 It is common to use multiple stateful services within a single solution
Stateful service considerations

 The most frequently-used stateful service
 Used for documents, images, audio, video, etc.
 Fast & inexpensive: GB/month storage, I/O requests, and egress bytes
 All cloud providers offer a file storage service
 No lock-in: It’s relatively easy to move files across providers if you avoid
provider-specific features
 File storage services offer public (read-only) access
 Send clients file URLs for them to access; reduces load on your other services!
 Use a Content Delivery Network (CDN) to improve performance even more
Files (blobs & objects) storage services

 Store many small related entities
 Common: query, joins, indexing, sorting, stored proc, viewers/editors, etc.
 As data increases, relational DBs (SQL) require expensive
hardware to address size & performance
 ACID goal: give impression that 1 thing at a time is happening no matter how
complex the work (looks like a single PC)
 NonRel-DBs (noSQL) spread data across many cheap PCs
 For customer preferences, shopping carts, product catalogs, session state, etc.
 Con: Can’t easily access all data (no sort/join); many are eventually consistency
 Pro: Cheaper & have flexible data models (entity ≈ in-memory object)
 Rel-DBs & NonRel-DBs will co-exist for years to come
DB storage services

Non-Relational
Database
Relational DB vs non-relational DB:
speed, size, simplicity, & price
Service #1
Relational
Database
(1 partition)
Service #2
Service #3
Service #4
Service #5
Service #1
Service #2
Service #3
Service #4
Service #5
Partition #1
Partition #2
Partition #3
Simple CRUD
Joins, sorts,
etc.
Complex CRUD,
joins, sorts,
stored procs,
X-table txns

 Data is partitioned for size, speed, or both
 Architecting a service’s partitions is often the hardest part of designing a service
 X-partition ops require network hops & different/distributed transactions
 How many partitions depends on how much data you’ll have in the future
 And how you intend to access that data
 Each partition’s data is replicated for reliability
 Replicating state increases chance of data surviving 1+ simultaneous failures
 But, more replicas increase cost & network latency to sync replicas
 For some scenarios, data loss is OK
 Replicas go across fault/update domains; avoids single point of failure
Data partitioning & replicas

 CAP theorem states
 When facing a network Partition (replicas can’t talk to each other):
 You can maintain Consistency by not allowing writes (loss of availability)
 You can maintain Availability by not replicating data (loss of consistency)
 Strong: all replicas see same data at same time
 Done via distributed transactions/locks across replicas communication
 Weak: replicas see different data at a moment in time but
eventually see the same data
 There are many factors pushing us towards weak consistency
 Txs rarely work across DBs & each microservice selects its own DB
 Caches improve perf by copying data which is out of sync with the truth
 CQRS pattern: writes data asynchronously but reads data synchronously
Data consistency

Load
Balancer
A cache can improve performance
but introduces stale (inconsistent) data
Stateful
Data
Other
Internal
Tiers
?
Stateless
Compute
Cache
Stateless
Web

 Concurrency control
 Pessimistic: accessor locks 1+ entries (blocking other accessors), modifies entries,
& then unlock them (unblocking another accessor)
 Bad scalability (1 accessor at a time) & what if locker fails to release the lock?
 Optimistic: accessor gets 1+ entries/version IDs, modifies entries if IDs haven’t
changed (contains the read value)
 Data schema versioning (without downtime)
 Backup & Restore Needed due to app bug/hacker
 Recovery Point Objective(RPO): Max data (minutes) business can afford to lose
 Recovery Time Objective(RTO): Max downtime business can afford to restore data
 NOTE: smaller RPO/RTO increases costs
Other DB concerns

@JeffRichter
Jeffrey Richter Software Architect
(Azure)
Microsoft
Вопросы?
www.linkedin.com/in/JeffRichter
JeffreyR@Microsoft.com

Jeffrey Richter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Jeffrey Richter

Similar to Jeffrey Richter (20)

More from CodeFest

More from CodeFest (20)

Recently uploaded

Recently uploaded (20)

Jeffrey Richter

Editor's Notes