2. 2
Traditional vs. Modern application
Traditional on-premises Modern cloud
Relational database Polyglot persistence
Strong consistency Eventual consistency
Design for predictable scalability Design for unbound scalability
Serial and synchronized processing Parallel and asynchronous processing
Monolithic, centralized Decomposed, de-centralized
Snowflake servers Immutable infrastructure
Integrated authentication Federated authentication
Design to keep app running (MTBF) Design for failure (MTTR)
Big bang release Frequent small update
Manual management Automated self-management
3. 3
Process of Software development
Analyze
business
domain
Design services
around business
domain
Implement
services on
platform
Business alignment Technology alignment
4. 4
Functional &
Non-functional
requirements
Choose
architecture style
Choose technology
Apply design
patterns & best
practices
Process of Software development
Architecture style guide
Design principles
Compute selection guide
Storage selection guide
Best practices
Design patterns
Design review checklist
Business capability
Domain driven
Data driven
End to end traceability
6. 6
Analyzing business domain
• Functional – Domain driven or data centric
• Ubiquitous language to model business domains
• Bounded context shows service boundary
• Context map visualizes to service dependency
• Aggregate, Domain service/event lead to microservices and inter service comm
• Non-functional - RTO/RPO/MTO, SLO/SLA, Security, Operation
• RTO leads to failover period
• RPO leads to backup interval
• SLA leads to choice of services w/ level of redundancy
• Throughput/Latency leads to choice of SKU w/ partitioning and topology
• Security leads to authN/authZ, encryption (transit, at rest)
• Operation leads to automated monitoring, management solutions
7. 7
What if we don’t model around business?
• Spending too much time on the features nobody’s going to use
• Wrong assumptions are baked into data and service model
• Change in technology directly affects service model
• System is not scalable nor secure
• SLA target is not met
• System health is not visible to operators
12. 12
Accounts
Drone management
3rd party
transportation
Call center
Video
surveillance
Drone
sharing
Drone
management
Drone sharing
3rd party
transportation
Shipping (Core)
Call center
Shipping
Context map – Drone delivery service
Surveyllance
Shipping
Accounts
13. 13
Aggregates and services in shipping domain
Shipping
Drone Package
Delivery DeliveryScheduler
DeliverySupervisor
Account
3rd party
transportation
Authentication
Aggregate
Domain
service
14. 14
Non-functional requirements – Shipping
• 10K to 100K rps to delivery scheduler
• 1M write/sec to geospatial index (geo-data / 4 sec / drone)
• Latency in 99 percentile has to be within 1 sec
• 99.99% uptime for delivery scheduler
• Daily update for ETA/Drone optimization algorithm
• CI/CD to support daily deployment
• Recovery time objective (RTO) should be 10 mins
• Recovery point objective should be an hour
• Drones and PII need to be protected from malicious attack
• Monitor the system for RCA and system health
19. 19
Constraints in microservices
- A service represents a single responsibility
- Services don’t share its data
- Service calls only via API
- Every service has to be independently deployable
- Release pipeline has to be de-centralized
21. 21
Choosing architecture style
• Business domain (Functional, Non-functional)
• Type/Complexity of domain
• Prerequisites
• Skillset, Team, Culture
• Benefits vs. Challenges
• Does benefits justify taking challenges?
• Degree of conformity
• Purist vs. Pragmatist
22. 22
When to choose Microservices?
Benefits
- Independent deployment
- Fault isolation
- Diverse technology
- Small focused team
- Separate scalability
Challenges
- Complexity
- Network congestion
- Data integrity/consistency
- Testing
- Reliability
Business domain
- Complex domain
- Deployment at high velocity
- Many independent teams
Prerequisites
- Skill set for distributed system
- Domain knowledge
- DevOps culture
- Monitoring capability
24. 24
Choosing architecture styles
Dependency management Domain type/complexity
N-Tier+ Horizontal layers (open/close) Traditional business domain
Frequency of update is low
Web-Queue-Worker Front/Backend jobs
Decoupled by async messaging
Relatively simple domain with some resource
intensive tasks
Microservices Vertical (functional) decoupling
Service calls via API
Complicated domain
Frequent update is required
CQRS R/W segregation
Schema/Scale are optimized separately
Collaborative domain where lots of users access
the same data
EDA Data ingested into streaming
Independent view per sub-system
Internet of things
Big data Divide huge dataset into small chunks
Parallel processing on local dataset
Batch and real-time data analysis
Predictive analysis using ML
Big compute Data allocation to thousands of cores
Embarrassingly parallel processing
Compute intensive domain such as simulation,
number crunching
26. 26
User
Business
logic
Polyglot
Storage
Device
Gateway
Streaming
& analytics
Device
control
Device
State & Mgmt
Business
System
Hot & Cold
Storage
Serving
&
BI
Device
API
Gateway
Web/Mobile application
IIoT
Data analysis
Batch
Analysis
Near
Real-time
Analysis
Notification
Remote
Service
User
management
Microservices
Cloud application architecture
SPA
Event Driven
Big data
N-TierWeb-Queue-Worker
CQRS
Big compute
Lift & Shift
30. 30
Best practices for N−Tier+ architecture
Web
tier
Database
Storage
Remote
service
NVA
Middle
tier 2
Messaging
Cache
Middle
tier 1
Jump
Box
User
Admin
Place multiple
NVA for HA
Use messaging
to decouple tiers
Cache semi-static data
Protect internet
access by NVA
Restrict access to data tier
Admin tasks via jump box
Use separate subnet/availability
set per tier with multi VMs
Configure redundancy
such as SQL AlwaysOn AG
31. 31
N-Tier+
• When to use:
• Migration scenario with minimum refactoring from existing app
• Simple web applications(e.g. admin web site)
• You need unified development/management across on-premises and cloud
• Benefits:
• High portability
• Less learning curve
• Natural evolution from traditional model
• Open to heterogeneous environment (Windows/Linux)
• Challenges:
• Monolith prevents independent deployment
• Manageability is not optimal
• Versioning of each service running on VMs
• Configuring network security is not trivial
• Conforming to industry regulations (e.g. PCI, SOX, HIPPA)
34. 34
Best practices for Web-Queue-Worker
SPA
&
Mobile
Web
frontend
services
SQL
NoSQL
CDN
Remote
service
Cache
Workers
Messaging
IdP
Partition data
Use polyglot storage
Auto-scale
instances
Decouple resource intensive jobsHost static content
Expose
consumer friendly API
Retry
transient faults
Cache semi-static data
37. 37
Web-Queue-Worker
• When to use
• Web applications with straightforward business logic
• You want to take advantage of managed services
• Benefits:
• Very first Azure architecture
• Relatively simple architecture that is easy to understand
• Easy to deploy and manage
• Clear separation of concerns
• The front end is decoupled from the worker using asynchronous messaging
• Challenges
• Without careful design, the web front end and the worker can become large, monolithic
components that are difficult to maintain and update.
• There may be hidden dependencies, if the front end and worker share data schemas or code
modules.
40. 40
Microservices – Best practices
SPA
&
Mobile
Microservices
SQL
NoSQL
API
Gateway
Remote
ServiceIdP
DevOps
Release process
Release process
Each service has a single
responsibility
Don’t share the data
directly
Every request goes through GW
Model service around business domain
Isolate failure
Decentralize all things
Don’t leak implementation details
Service calls via API
Offload cross cutting concerns to GW
Use polyglot storage
41. 41
Microservices
• When to use
• Requires continuous innovation
• Requires deployment at high velocity
• Deals with complex domain
• Benefits
• Independent deployment
• Fault isolation
• Diverse technology
• Small focused team
• Separate scalability
• Challenges
• Complexity
• Network congestion
• Data integrity/consistency
• Testing
• Reliability
46. 46
CQRS
• When to use
• Collaborative domain with lots of operations to the same data
• R/W mismatch causes issues
• High scalability is required
• Benefits
• Separate scalability for R/W
• Decoupling Read from Write
• Optimal schema for read and write
• Challenges
• Data consistency issues
• Complex implementation
49. 49
Event Driven
• When to use
• Multiple subsystems process the same event
• Real-time processing with minimum time lag
• Complex event processing such as pattern matching
• Event processing with high ingestion rate such as IoT
• Benefits
• No point to point Integrations
• Immediate actions at consumer (minimum time lag)
• Very well decoupling producers from consumers
• Highly scalable and distributable
• Challenges
• Reliability, losing a single event could make system unstable (guaranteed delivery)
• Order of processing
• Exact once processing
55. 55
Big data reference architecture
Data
source
Batch
processing
Stream
analysis
Serving
layer
Data streaming
Business
intelligence
Orchestration
Data storage
57. 57
Best practices for Big data
Data
source
Batch
processing
Stream
analysis
Serving
layer
Data streaming
Business
intelligence
Orchestration
Data storage
Implement retention policy
Upload large dataset using multiple threads in parallelScrub sensitive data
before publishing
Partition the data
Automate data ingestion and process by orchestration tools
Provision a separate cluster
for Hbase/Storm than batch
processing
Prevent data skew issue
Use protocol conversion to speed up
58. 58
Big data – service mapping
Data
source
Batch
processing
Stream
analysis
Serving
layer
Data streaming
Business
intelligence
Orchestration
Data storage
Device
Weblogs
Click stream
OLTP
Azure Data Lake Store
Azure storage
EventHub
IoT Hub
Kafka
ASA
Spark
Storm
ADLA
HDInsight
HBase
Cassandra
DocumentDB
SQL DB/DWH
Spark
Power BI
Excel
SSAS
Tableau
Qlikview
Custom app
Azure Data Factory
Oozie
SSIS
59. 59
Big data
• When to use
• Process TB ~ PB of data in a timely manner
• Pre-process raw data and pass the aggregated results to BI
• Real-time processing
• Experiment new data type quickly
• Predictive analysis
• Benefits
• Cost effective solution for large dataset
• High performance by parallel processing with data locality
• Challenges
• Data ingestion
• Numerous combination of technologies
• Too many knobs to optimize performance
• Security
61. 61
Accounts
Drone management
3rd party
transportation
Call center
Video
surveillance
Drone
sharing
Drone
management
Accounts
Drone sharing
3rd party
transportation
Shipping (Core)
Call center
Shipping
Context map – Drone delivery service
Surveyllance
62. 62
Aggregates and services in shipping domain
Shipping
Drone Package
Delivery DeliveryScheduler
DeliverySupervisor
Account
3rd party
transportation
Authentication
DeliveryScheduler
DeliverySupervisor
Delivery
Package
Authentication
Drone
Account
3rd party
transportation
Microservices
Microservices
In different BC
65. 65
User
Business
logic
Polyglot
Storage
Device
Gateway
Streaming
& analytics
Device
control
Device
State & Mgmt
Business
System
Hot & Cold
Storage
Serving
&
BI
Device
API
Gateway
Web/Mobile application
IIoT
Data analysis
Batch
Analysis
Near
Real-time
Analysis
Notification
Remote
Service
Drone delivery application architecture
User
management
Drone geolocation and ETA
User device geolocation
Account management
Drone scheduling
ETA
Drone placement
I’m trying to compare the common characteristics of each
These common characteristics raise questions that you need to answer.
How to choose the right storage? (Polyglot cheat sheet)
How to deal with eventual consistency issues? (Data consistency primer)
How to make apps scalable? (Auto-scaling guidance)
How to control concurrent access? (Concurrent access guidance, WIP)
How to decompose a monolith to distributed components? (Data/Compute partitioning guidance)
How to make apps immutable?
How to choose the right authentication model? (Identity guidance)
How to design multi-tenant apps? (Multi-tenant guidance)
How to deal with transient/non-transient faults? (Retry guidance)
https://dzone.com/articles/martin-fowler-snowflake
Even the industry is shifting toward new business model,
We still need to align business with technologies in 3 steps.
You need to analyze business domain to capture their requirements
Design service to realize the requirements
Implement service on technology platform
3 steps at very high level.
You can say industry 4.0 can be implemented by IoT-Hub and Service fabric
but there’s a huge gap between business and technology.
We need to fill in that gap. How can we translate business into technology?
Not to mention this is an iterative process
If this is a migration scenario and you already know enough about the domain, that’s fine.
But in greenfield scenario, work with domain experts to analyze requirements is key for success.
We need a domain expert to work with on business analysis
Price lookup in POS requires 500ms latency
Encrypt sensitive data over the wire or at rest
Can’t emphasize the importance of this step enough
XA transactions across two diff custom storage in large SI
>50% of BI solutions are not being used in largest retail franchise
Anemic domain model
UBER baked in wrong assumptions which caused massive refactoring
Separate service per each storage.
Scale and Security can’t be after thought
Same with availability and operations
You have to make system observable for operators
Easier said than done so we started our own project to show how to model service around business.
Simply put, this service is to deliver good for you in a matter of minutes using drones.
Context map shows mapping among bounded context as well as domains and BC
Domain represents problem space
BC represents solution space
Context map shows mapping among bounded context as well as domains and BC
Domain represents problem space
BC represents solution space
BC is a linguistic boundary
Ideally 1:1 mapping (esp. greenfield)
Mapping patterns: tightly or loosely coupled
Context map shows mapping among bounded context as well as domains and BC
Domain represents problem space
BC represents solution space
Now we have domain model analyzed. Next step is to choose right architecture
Once you define business domain, nest step is to choose arch style.
Same domain can be implemented by different styles.
Windows-DNA, COM/COM+
RIA, Silverlight
SOA, Web service
Cloud, Azure
Microservices, Containers/SF
If you choose architecture for small apartment and use it for skyscraper, it will crush.
Architecture acts as a mold to organize service design and dependency
Architectural styles as constraints. The high-level concepts in an architectural style impose a set of constraints on the architecture. These contraints guide the design and create a “shape” - The hope is that by conforming to these constraints, certain desirable properties will emerge. Therefore it’s important to understand not just the constraints (the “shape” of the architecture) but the motivation behind them.
You need to follow these rigid rules, otherwise you don’t get benefit from MSA
Which is better? How do we make that decision?
How can we make decisions?
We should keep these 4 dimensions in mind
- Affinity to a particular Business domain
Prerequisite (You must be this tall to use XXX) If you don’t have enough skillset, don’t choose it
Does Benefit justify taking challenges?
Purist vs. pragmatist. I’d rather be a pragmatist meaning you have to adjust the degree of conformity to the reality
Messaging, concurrency control, eventual consistency
DevOps culture: CI/CD, Automation, Self provisioning/management
Monitoring (Correlation) is critical for RCA
Each service gets simplified but complexity is moving to integration part which is networking among services
How can you do E2E/integration testing?
More service means more surface area to fail.
Is this the goal you’re aiming for?
Do you meet the prerequisites?
Does benefit justify taking these challenges?
Many services means many point of failure.
Figure if MSA is the right choice depending on these four dimensions
Monolith and microservices are two extremes in the spectrum
You’ll end up somewhere in between, then continue decomposing further down instead of trying to design perfect microservices from day 1.
They are all based on our customer engagements.
Good one becomes patterns / best practices, bad one becomes anti-patterns
There’s no point for functional partitioning when you don’t have much functions.
Three main areas In cloud applications. There’s no clear distinction between them.
Detail of this diagram doesn’t matter a lot. It’s a simplification of each piece.
For example, batch operations such as daily report can be implemented as backend service or part of data analysis.
IIoT and Big data has some overlaps.
User mgmt is integrated with AD, CRM etc.
CDN & Notification?
Three main areas In cloud applications. There’s no clear distinction between them.
For example, batch operations such as daily report can be implemented as backend service or part of data analysis.
IIoT and Big data has some overlaps.
User mgmt is integrated with AD, CRM etc.
CDN & Notification?
Three main areas In cloud applications. There’s no clear distinction between them.
For example, batch operations such as daily report can be implemented as backend service or part of data analysis.
IIoT and Big data has some overlaps.
User mgmt is integrated with AD, CRM etc.
CDN & Notification?
As the name implies, N number of tiers. Most common setting is 3 tier w/ web, middle, and DB. Sometimes more than 1 middle tier.
IaaS + PaaS offerings such as messaging, cache, and storage.
Jumpbox to restrict access from administrator.
Let me show you best practices in this architecture.
Fortinet, F5, Barracuda
HA-NVA from p&p
Isolate each tier by network subnet
Also put VMs per each tier into different availability set so at least one instance per each tier will be up and running
Jumpbox is only backdoor open for a particular client IP. Whitelist IPs
Logical layers and physical tiers. It’s not necessarily 1:1 mapping. Middle tier is optional.
Layers are the way to manage dependency. Open/Close model.
IaaS + PaaS (Cache, Messaging, DB/Storage are very common)
Web frontend offloads backend jobs to workers via messaging
The rest of the picture looks similar to N-Tier.
Variation to skip web frontend as description
RESTful API: Intention revealing, API versioning, security, Async, Batch in RESTful manner
Retry: Exponential backoff for non-interactive trx, Linear for interactive trx
Cache population strategy
Partition data to workaround performance and size limits.
Choose data store that best meets the needs
Offload background jobs to workers using async messaging for decoupling them
CDN: Serve static content from CDN to offload from compute
Auto-scaling: Schedule vs. Parameter
Let’s go clockwise
Sharding: UBER partition data by city which leads to hotspot.
Index table pattern: Using search for indexing
Valet key: Shared access signature
There’s a reason devs have done this.
Vertical slice of business domain w/ each slice becoming individual service
MSA is all about reducing dependency among hundreds of services
Since there’re 100s of services, somebody needs to know which service is running on where.
GW does that. GW can also takes care of cross cutting concerns such as logging, auth etc.
Each service does only a single responsibility
Isolated from other services in terms of failure
All it exposes is contract, don’t leak internal details
Data should be private to its service
Model each service around business not technology nor just data.
Services should interact only via API, not direct access to data
Decentralize all things especially release process for independent deployment.
System of engagement requires continuous innovation for the better user experience
It requires frequent deployment
Microservices make more sense to complex business domain than simple CRUD based ones.
CQRS segregates R/W and manage them separately
It is not limited to but often used in MSA
Here’s the reason
Once delivery is completed, account service has to do followup tasks
But account shouldn’t directly access data belong to Delivery
Other services like Delivery history, Drone also consumes the same events
Using transaction log or event sourcing is another option. There’re OSS components that supports this scenario.
Benefits:
No point to point Integration
Very well decoupling producers from consumers
Highly scalable and distributable
Challenges:
Reliability, when you lose one event it not easy to recover from there
Variations:
Simple event processing: downstream actions are performed as new events are generated without time lag. Azure functions for many triggers
Complex event processing: Process a series of events for pattern matching. Using ASA, Storm/Spark
Event stream processing: Use streaming service and multiple consumer per different subsystem. e.g. IoT workload
- Devices can be connected directly or indirectly via a gateway
Cloud gateway provides endpoints for device connectivity and facilitates bidirectional communication with the backend system
ML to detect patterns (F1 racing)
App backend does device control process and send command via GW to devices
Often IoT solution is integrated with BI or other LoB systems via serving/adaptor
- Devices can be connected directly or indirectly via a gateway
Cloud gateway provides endpoints for device connectivity and facilitates bidirectional communication with the backend system
ML to detect patterns (F1 racing)
App backend does device control process and send command via GW to devices
Often IoT solution is integrated with BI or other LoB systems via serving/adaptor
Real-time processing
Batch processing
Process on-the-fly by streaming and store the interim results then do more analysis
Store incoming data in cold storage first, preprocess then do stream analysis
Lambda vs. Kappa
Real-time and batch are converging in Kappa architecture.
Spark structured streaming is a way to go.
Upload large dataset using multiple threads in parallel
Use protocol conversion if necessary to speed up data transfer
Scrub sensitive data before publishing them
Encrypt sensitive data using
Partition the data (e.g. per day)
Secure the data access by XXX
Automate data ingestion and XXX by orchestration tools
Implement retention policy
Backup?
Provision a separate cluster for Hbase/Storm than batch processing
Data skew problem
Cascading process
CloudEra in Batch
Upload large dataset using multiple threads in parallel
Use protocol conversion if necessary to speed up data transfer
Scrub sensitive data before publishing them
Encrypt sensitive data using data encryption at rest
Partition the data (e.g. per day)
Secure the data access by XXX
Automate data ingestion and XXX by orchestration tools
Implement retention policy
Backup?
Provision a separate cluster for Hbase/Storm than batch processing
Data skew problem
Cascading process
Scenario:
Predictive analysis in construction machines (Caterpillar, Sandvik)
Connected car (Toyota)
Near real-time ETL (MS sales)
Interactive query in eCommerce (Jet.com)
Ingestion:
Caterpillar: 250 eps
Sandvik: 150K eps (PB per customer)
Serving layer:
Spark to generate Parquet format table for Tableau
Spark connector for DocDB
4 major use cases
Batch like ETL before sending data to BI
Real-time processing (Analyze click stream and optimize content placement)
Interactive data exploration (Data scientist explores new data and find patterns)
Predictive analysis (F1 racing)
Weblogs
F1 racing
Boeing 787
Car telematics: driving pattern to estimate risk of incident
Context map shows mapping among bounded context as well as domains and BC
Domain represents problem space
BC represents solution space
There’s no mechanical way to make decisions.
Some aggregates deserve to be microservices, others don’t.
Some aggregates should count on the ones in different BC.
Responsibility of aggregates
Delta
Team size
Dependency
Latency
How delivery service know its status? Is it coming from delivery mgmt service? (pull or push)
Do we want to merge requestHandler and GW?
GW does only token checking, delegate auth to auth service in account BC
Why it has Package, Drone, Delivery as service but no service for account and 3rd party? Do we need them?
Why doesn’t delivery service contain drone and package aggregate?
Does drone need persistent storage or cache?
What is the best API style?
Depending on the responsibility and latency req of the drone service in this context, it can be just caching status
Every event from drone come via EventHub to only DroneMgmt or + Delivery service?
Account service subscribes delivery events and do the following once it’s completed
Collect ratings, send emails, schedule payment
How delivery service know its status? Is it coming from delivery mgmt service? (pull or push)
Do we want to merge requestHandler and GW?
GW does only token checking, delegate auth to auth service in account BC
Why it has Package, Drone, Delivery as service but no service for account and 3rd party? Do we need them?
Why doesn’t delivery service contain drone and package aggregate?
Does drone need persistent storage or cache?
What is the best API style?
Depending on the responsibility and latency req of the drone service in this context, it can be just caching status
Every event from drone come via EventHub to only DroneMgmt or + Delivery service?
Account service subscribes delivery events and do the following once it’s completed
Collect ratings, send emails, schedule payment
Three main areas In cloud applications. There’s no clear distinction between them.
For example, batch operations such as daily report can be implemented as backend service or part of data analysis.
IIoT and Big data has some overlaps.
User mgmt is integrated with AD, CRM etc.
CDN & Notification?