A talk I gave at Craft Conf 2023 discussing the various trade-offs of orchestration and choreography interaction patterns for microservices.
It covers a lot of ground and introduces concepts around software evolution, complexity, coupling and organisation design. Looking at how the decisions we make (either deliberately or implicitly) when designing systems can lead to long term success or suffering.
6. Hi 👋, I’m Ian
I’m a software engineer from the UK, currently
building work products at Meta as part of
Reality Labs.
Previously, I was VP of Web Architecture for
Genesis and Chief Digital Technology Architect
for PokerStars.
www.ian-thomas.net | @anatomic | linkedin.com/in/anatomic
7. On The Menu Today…
● The need for change and how our decisions impact it
● How complexity and coupling are conspiring to slow your progress
● Orchestration and choreography patterns
● State, data, failure and humans
● Tooling and processes to help smooth the road
I encourage you to keep in mind five “C”s:
Communication, Consistency, Coordination, Coupling and Complexity
10. Continuing Change
Self Regulation
Conservation of
Familiarity
Declining Quality
Increasing Complexity
Conservation of
Organisational Stability
Continuing Growth
Feedback System
Requires continual adaptation
or it becomes progressively less
satisfactory
Complexity increases unless
work is done to maintain or
reduce it
Functional content must be
continually increased to
maintain user satisfaction
Quality will appear to be
declining unless a system is
rigorously maintained and
adapted to operational
environment changes
12. Continuing Change
Self Regulation
Conservation of
Familiarity
Declining Quality
Increasing Complexity
Conservation of
Organisational Stability
Continuing Growth
Feedback System
Change in the
world outside
systems drives the
need for growth
and change within
13. Continuing Change
Self Regulation
Conservation of
Familiarity
Declining Quality
Increasing Complexity
Conservation of
Organisational Stability
Continuing Growth
Feedback System
Change in the
world outside
systems drives the
need for growth
and change within
Change in the
system itself is
self-limiting
unless deliberate
effort is exerted
14. “[..] shows the continuing growth of the system
(first law) albeit at a declining rate
(demonstrably due to increasing difficulty of
change, growing complexity (second law)”
M. Lehman - Program, Life Cycles, and Laws of Software Evolution
36. Circuit Breakers Timeouts*
Service Discovery Retries
Healthchecks Auto-scaling
Bulkheads Mutual TLS
Handling Failure
Many of these requirements can be pushed to the platform
37.
38. Orchestration
Pro Con
Single controller managing workflow state Single point of failure
Complex error handling is easier to manage Additional latency
Platform tooling increasingly removing complexity from
applications (especially for synchronous calls)
Scalability
Recoverability Responsiveness
Lower cognitive load Coupling between orchestrator and services
Version controllable workflow definitions
Lots of tooling to support o11y of API driven services
44. Types of Event
Event Notification Announcing facts, with no expectation of action or response
Event-Carried State Transfer Reduce chattiness between services by including data in the event
Event Sourcing Events are recorded in a persistent log, allowing for replay and state reconstruction
CQRS Separate reading and writing, handles broad variation in access patterns
Martin Fowler - What do you mean by “Event-Driven”?
53. Choreography
Pro Con
Weak coupling between services Complexity grows rapidly with event cardinality
Scalability Typically requires intermediate infrastructure
Responsiveness Versioning of events
Fault tolerance, no single point of failure No single view of workflow state
High throughput Hard to version control workflow
Error handling (especially at the workflow level)
55. …use orchestration within the bounded
context of a microservice, but use
choreography between bounded-contexts.
Yan Cui – Choreography vs Orchestration in the land of serverless
56. …use orchestration within the bounded
context of a microservice, but use
choreography between bounded-contexts.
Yan Cui – Choreography vs Orchestration in the land of serverless
58. Formally Specified Informally Specified
How formally do you need to specify your workflow?
Orchestration
service, workflow
defined using
DSL/programming
language
(declarative)
Custom
orchestrator,
workflow defined in
general purpose
programming
language
(imperative)
Front controller
knows about
workflow
Stateless, hopefully
documented,
potentially just in a
few people’s heads
Orchestration Choreography
59. An architect can never reduce semantic
coupling via implementation, but they
can make it worse.
Neal Ford, Mark Richards, Pramod Sadalage & Zhamak Dehghani – Software Architecture: The Hard Parts
61. Types of Coupling
Operational A consumer can’t run without a provider
Developmental Changes in producers and consumers must be coordinated
Semantic Change together because of shared concepts
Functional Change together because of shared responsibility
Incidental Change together for no good reason
Michael Nygard - Uncoupling
Temporal effects?
74. Types of Coupling
Operational A consumer can’t run without a provider
Developmental Changes in producers and consumers must be coordinated
Semantic Change together because of shared concepts
Functional Change together because of shared responsibility
Incidental Change together for no good reason
Michael Nygard - Uncoupling
Organisational Progress can only be achieved through others
76. The value of orchestration increases with workflow
complexity, notably with complex error scenarios.
Comparatively, responsiveness and scalability
requirements favour choreography, especially when
error handling is minimal.
77. Orchestration Choreography
Operational Coupling Strong Very weak
Developmental Coupling Strong Weaker, caution advised
Semantic Coupling Less Strong Weak
Functional Coupling Less Strong Weak
Incidental Coupling Less likely, potentially easier to find Harder to find, more sinister when present
Scalability Scale cascades, less suitable for parallelism Backpressure to decouple, easier to parallelise
Reliability Only as good as your weakest link Careful design decouples uptime
Responsiveness Central bottleneck, processing chains add latency Highly responsive due to reduced operational coupling
Fault Tolerance Low, due to single point of failure effect Excellent
Error Handling Eased through central state management Harder, risk of event explosion and passive/aggressive
Cognitive Load Lower Complexity grows with number of events
Observability Traditional tooling and central state make o11y easier Potentially more difficult, requires strong platform support
Orchestration vs Choreography
@anatomic’s rough guide to
81. Four pillars of Event Streaming Capabilities
Business Function Instrumentation Control Plane Operational Plane
Actually doing the
work we need, the
business function (or
“core” plane) is where
the value lies for our
customers and the
business.
The metrics and
telemetry necessary
for us to determine if
the system is working
as expected.
Systems will keep on
chugging, even when
we might need them
to stop. Control
planes help manage
change, including
pausing, scaling and
rate-limiting.
Tooling and processes
to help run our
systems, including
addressing failure
modes (wiping data
and corrective
actions), upgrade
processes and
evolutionary support.
https://www.confluent.io/en-gb/blog/journey-to-event-driven-part-4-four-pillars-of-event-streaming-microservices/
More difficult to implement in
event-driven systems
84. A B
What’s in a line?
Inter-process communication
Traffic flow
DNS
Service discovery
Routing
Schemas
Certificates
Firewall
Physical connection
Load balancing
AuthN/AuthZ
Secrets
Data format
Protocol
Failure modes
85. A B
What’s in a line?
Inter-process communication
Traffic flow
DNS
Service discovery
Routing
Schemas
Certificates
Firewall
Physical connection
Load balancing
AuthN/AuthZ
Secrets
Data format
Protocol
Failure modes
86. A B
What’s in a line?
Inter-process communication
Traffic flow
DNS
Service discovery
Routing
Schemas
Certificates
Firewall
Physical connection
Load balancing
AuthN/AuthZ
Secrets
Data format
Protocol
Failure modes
Different priorities
Required changes
Change management
ITIL
Change Advisory Boards
Time zones
Backlogs
Scrum of Scrums
Language
87. A B
What’s in a line?
Inter-process communication
Traffic flow
DNS
Service discovery
Routing
Schemas
Certificates
Firewall
Physical connection
Load balancing
AuthN/AuthZ
Secrets
Data format
Protocol
Failure modes
Different priorities
Required changes
Change management
ITIL
Change Advisory Boards
Time zones
Backlogs
Scrum of Scrums
Language
🤯
88. A B
What’s in a line?
Inter-process communication
Traffic flow
DNS
Service discovery
Routing
Schemas
Certificates
Firewall
Physical connection
Load balancing
AuthN/AuthZ
Secrets
Data format
Protocol
Failure modes
Different priorities
Required changes
Change management
ITIL
Change Advisory Boards
Time zones
Backlogs
Scrum of Scrums
Language
Testing Long-running workflows
Schema evolution Ownership of workflow state
Fallacies of distributed computing Team geo-distribution
Serialisation formats Distributed tracing
Accidental coupling Self-service infrastructure
How are you going to handle…
And all the other stuff that won’t fit on a slide
89.
90. Schema Evolution
Operational Changes Allowed Schemas Validated Upgrade First
Backward
● Delete fields
● Add optional fields
Last version Consumers
Backward transitive
● Delete fields
● Add optional fields
All previous versions Consumers
Forward
● Add fields
● Delete optional fields
Last version Producers
Forward transitive
● Add fields
● Delete optional fields
All previous versions Producers
Full
● Add optional fields
● Delete optional fields
Last version Any order
Full transitive
● Add optional fields
● Delete optional fields
All previous versions Any order
None ● All changes accepted None Depends
https://docs.confluent.io/platform/current/schema-registry/avro.html - compatibility-types
91. A B
What’s in a line?
Inter-process communication
Traffic flow
DNS
Service discovery
Routing
Schemas
Certificates
Firewall
Physical connection
Load balancing
AuthN/AuthZ
Secrets
Data format
Protocol
Failure modes
Different priorities
Required changes
Change management
ITIL
Change Advisory Boards
Time zones
Backlogs
Scrum of Scrums
Language
Testing Long-running workflows
Schema evolution Ownership of workflow state
Fallacies of distributed computing Team geo-distribution
Serialisation formats Distributed tracing
Accidental coupling Self-service infrastructure
How are you going to handle…
And all the other stuff that won’t fit on a slide
92.
93. Favour orchestration for complex workflows,
choreography for scalability + weaker coupling
Enable long-term changeability through
deliberate design + trade-off analysis
Complexity breeds in the bits between our systems,
handle with care (+ don’t forget about the humans!)
www.ian-thomas.net | @anatomic | linkedin.com/in/anatomic
94. Thanks 🖖
If you’re interested in chatting more about any
of the topics covered in this talk, come and grab
me in the hallway track or virtually through
Twitter or LinkedIn.
Thank you for coming to hear me speak!
www.ian-thomas.net | @anatomic | linkedin.com/in/anatomic
2023