En esta sesión cubriremos las mejores prácticas para crear y administrar un clúster de Azure Service Fabric de forma segura y escalarlos en función de la demanda.
3. Alberto Diaz Martin
alberto.diaz@encamina.com - @adiazcan
Alberto Diaz cuenta con más de 15 años de experiencia en la Industria IT, todos ellos trabajando
con tecnologías Microsoft. Actualmente, es Chief Technology Innovation Officer en ENCAMINA,
liderando el desarrollo de software con tecnología Microsoft, y miembro del equipo de
Dirección.
Para la comunidad, trabaja como organizador y speaker de las conferencias más relevantes del
mundo Microsoft en España, en las cuales es uno de los referentes en SharePoint, Office 365 y
Azure. Autor de diversos libros y artículos en revistas profesionales y blogs, en 2013 empezó a
formar parte del equipo de Dirección de CompartiMOSS, una revista digital sobre tecnologías
Microsoft.
Desde 2011 ha sido nombrado Microsoft MVP, reconocimiento que ha renovado por séptimo
año consecutivo. Se define como un geek, amante de los smartphones y desarrollador.
Fundador de TenerifeDev (www.tenerifedev.com), un grupo de usuarios de .NET en Tenerife, y
coordinador de SUGES (Grupo de Usuarios de SharePoint de España, www.suges.es)
10. VM #1
Service Fabric
Your code, etc.
VM #2
Service Fabric
Your code, etc. VM #3
Service Fabric
Your code, etc.
VM #4
Service Fabric
Your code, etc.
VM #5
Service Fabric
Your code, etc.
Your code, etc.
(Port: 19080)
Web Request
Port: 80
Service Fabric cluster
18. Service Fabric Cluster
Key Vault
AAD
Security
LB#3LB#2LB#1
NSG#1 NSG#2 NSG#2
VMSS* ##1
VM
VM
VM
VMSS* #1
VM
VM
VM
VMSS#1
VM
VM
VM
For
Diagnostics
Azure Storage
For SF logs
For VHDs
For VHDsManaged Disk
For VHDs
Service Fabric Cluster
VNET
LB#3LB#2LB#1
VMSS#1
VM
VM
VM
VMSS#2
VM
VM
VM
VMSS#3
VM
VM
VM
NSG#1 NSG#2 NSG#3
Jump Server
19. ClientConnectionEndpoint (TCP) 19000
HttpGatewayEndpoint (HTTP/TCP) 19080
SMB support for Image Store 445, 134
ClusterConnectionEndpointPort (TCP) 1025
LeaseDriverEndpointPort (TCP) 1026
Ephemeral Port range As needed, min 256
ports
App ports As needed
25. FD1 FD2 FD3 FD4 FD5
• Number of FDs determines the headroom needed in case of unplanned failures
• Examples include a PDU failing or TOR maintenance that can take out all
machines in a rack
• In terms of capacity – you need to leave enough headroom to accommodate
failure of at least one FD
• This will result in SF moving/creating new replicas on the available machines in
other FDs
PDU Burn out
Replica
26. FD1 FD2 FD3 FD4 FD5
• Number of Upgrade Domains determines the headroom needed in case
of planned failures/downtimes
• An example is when a Service Fabric upgrade going on, and a UD is
down, you have to have room for additional replicas if need be
Replica
UD1 UD2 UD3 UD4 UD5 UD6 UD7 UD8 UD9 UD10
SF upgrade
27. You should plan your capacity in such a way that your service
can at least survive:
• A loss of one FD
• A UD being down because of an upgrade going on
• A additional random node/VM failing
FD1 FD2 FD3 FD4 FD5
UD1 UD2 UD3 UD4 UD5 UD6 UD7 UD8 UD9 UD10
35. The Recovery Point Objective (RPO) determines
the amount of data you can afford to lose in a disaster
The Recovery Time Objective (RTO) is the
maximum tolerable length of time that your service can
be down after a disaster occurs
36. Types of Disasters
RPO and RTO = 0, Write
latency acceptable
RPO and RTO > 0
Data Center Outages Cross-regional SF cluster Stand up a new cluster,
restore from backup
Cluster down (Very low probability for cross-
regional clusters)
Stand up a new cluster,
restore from backup
Stand up a new cluster,
restore from backup
Machine / Node down Deploy across 5+ FDs, 5+ UDs,
Design for write quorum losses
Deploy across 5+ FDs, 5+ UDs,
Design for write quorum losses
Other sources of data loss
or “oops”
Restore from backup Restore from backup
38. Cluster and
Node state
Is the cluster healthy?
Are all the nodes up?
Detect and diagnose hardware
and infrastructure issues
Application
and Service
state
Upgrade status, number of
services and replicas
Detect software and app issues,
reduce service downtime
Resource
Usage
Do all the nodes need to be up?
What is the average CPU
usage?
Understand resource
consumption and drive better
business decisions
Performance
Tracking
Is there any unexpected
latency? Are the services
responsive?
Optimize application, service,
and infrastructure performance
Custom
Application
Metrics
Is your app being used in the
way that you expected? Is
solution effective?
Generate business insights and
improvements
Although we support the use of certs on standalone, we recommend that you use AD.
For any production deployment, always use automated deployment. Use the tool of your choice, or Powershell scripts
In azure, Use Certificates for client access only as a “break glass” scenario.
For any production deployment, always use automated deployment. Use the tool of your choice, or powershell scripts
ARM template used: https://github.com/Azure/azure-quickstart-templates/tree/master/service-fabric-secure-nsg-cluster-65-node-3-nodetype
In azure you do not get to choose the number of FDs. The VMSS instances are spread across 5 FDs.
In azure you do not get to choose the number of UDs. The VMSS instances are spread across 5 UDs.
The link above points to : https://docs.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-resource-manager-cluster-description#cluster-capacity
Now let us shift our focus to the best practices for setting up clusters in Azure…
This matrix represents suggested mitigations. The actual mitigation that you adopt depends on your applicaiton and Business continuity plans.
When it comes to monitoring, think about monitoring not only your cluster, nodes and application. Think about how you an use it to monitor resource usage, application performance and effectiveness of your application. You will need to add custom application metrics to determine, if you service is truly doing what is supposed to do…
Make your E2E operational scenarios easier by using the Azure ServiceFabric RM module
Adopt the best practices for planning, deploying and securing your clusters
Write down a Business continuity plan, disasaters happen and it is best to be prepared for it
Leverage all the out of the box monitoring and diagnostics capabilities.