Systems that meet stringent service availability (SA) and high availability (HA) requirements have been around for decades, but diverse segments use varied terminology to describe the same concepts. This session will provide a high-level technical overview of the Service Availability Forum standards and the support of those standards within OpenSAF, allowing those familiar with HA concepts to map their terminology to SA Forum and OpenSAF terminology.
The session will also help those relatively new to OpenSAF or the HA domain to familiarize themselves with the terms and concepts. This session will lay the technical foundation for the remainder of the symposium so that attendees get the most out of the more detailed presentations that follow.
OpenSAF involves a number of complex ideas and is designed to work in many different environments. In order to make it easy for new users to get started, we will also detail options that new users have to educate themselves about OpenSAF and relevant environments for using the code base and interacting with the community.
2. Introduction to OpenSAF
• Service availability and high availability systems and
concepts have been around for decades
• However, HA terminology tends to vary from industry to
industry and company to company
• Goals of this session:
– High-level technical overview of the Service Availability™ Forum
standards
– Overview of the support of those standards within OpenSAF
– Allow you to:
• Familiarize yourself with general HA concepts and terminology
OR
• Map the HA concepts and terminology with which you are
familiar to the SA Forum and OpenSAF versions
– Resources for getting started with OpenSAF
3. SA Forum Interfaces: AIS & HPI
Applications
Application Interface Specifications (AIS)
Service Availability Middleware
System Management
SAF Software Mgmt Availability Lock (LCK)
Framework (SMF) Management
Standards Framework (AMF)
Implemented Information Checkpoint (CKPT)
by OpenSAF Model Mgmt (IMM)
Cluster Membership (CLM)
Event (EVT)
Notification (NTF)
Log (LOG) Platform Mgmt (PLM) Message (MSG)
Operating System
Virtualization
Hardware Platform Interface (HPI)
Hardware Hardware Hardware Hardware
Platform A Platform B Platform C Platform D
4. But how to make sense of the
SA Forum “acronym soup”?
5. AIS Service Groupings
• First, understand that the AIS services fall into three
logical groupings*:
System Management Resource Availability Application Services
Services Management Services
Information Availability Checkpoint (CKPT)
Model Mgmt (IMM) Management
Framework (AMF)
Event (EVT)
Software Mgmt
Framework (SMF)
Cluster Membership (CLM)
Message (MSG)
Notification (NTF)
Platform Mgmt (PLM) Lock (LCK)
Log (LOG)
Services that manage central Services that manage and Optional services to support
system capabilities commonly monitor the state of key system application operations such as:
resources that affect availability: • Inter-process
used by both: • Hardware / Operating communication
• AIS services system • State replication
• Applications • Cluster nodes • Shared resource access
• Applications control
* - Not official SA Forum AIS service groupings
6. Fault Management Cycle
• Second, AIS services that
manage availability are
designed around a standard
fault management cycle
– Detection Detection
• E.g. component
healthchecks
– Isolation
• E.g. blade power off
– Recovery Repair Notification Isolation
• E.g. failover of workload
assignments to associated
standby resources
– Repair
Recovery
• E.g. automatic restart of
failed resource
– Notification
• E.g. state change
notifications sent by service
managing the resource
7. Resource Dependencies
• Third, Availability Management in the AIS world is
Managed
driven by a detailed understanding of the availability Applications
management dependencies across all resource types
– Managed Applications
• Simple to complex dependencies and relationships can be
modeled between the various software elements
• Dependency on a particular node also modeled AMF Node
– AMF Node
• Represents a node where AMF services are provided
• Depends on a CLM node
– CLM Node CLM Node
• Represents a cluster node where AIS services are
provided
• Depends on an Execution Environment (optional)
– Platform Resource
• Containment and logical dependencies represented Platform
between platform resources Resource
• Execution Environment (EE)
– Represents an operating system instance (standalone or
virtual)
• Hardware Element (HE) Hardware Execution
– Represents a physical hardware resource in the system Element Environment
8. Common Design Patterns
• Fourth, the AIS services follow common design
patterns:
– API
• Common library lifecycle
• Naming conventions
– Resource managed by service Managed object
• Typically with associated state model
• Managed objects stored in common information model
– Administrative operations
• X.731 style administrative operations for resources which
affect availability
– Notifications automatically generated by AIS services for
significant system events (alarms, state changes, etc.)
9. Resource Availability Management Services
• Availability Management Framework (AMF)
– Manages the lifecycle and monitors the state of the managed
applications within the system
– More detail in upcoming slides
• Cluster Membership (CLM)
AMF
– Provides cluster membership change notifications to AIS services
and interested applications
– OpenSAF CLM implements cluster management protocol dealing
with:
• Cluster formation CLM
• Active controller selection & failover
• Node failure detection
• Platform Management (PLM)
– Manages the state of modeled hardware elements and execution
environments (operating system instances) PLM
– Hardware element states and events accessed through Hardware
Platform Interface (HPI)
– Manages graceful blade extraction / de-activation cases
– Supports hardware element controls (power on/off and reset)
– Optional service within OpenSAF
10. Availability Management Framework (AMF)
AMF Logical Entities
• Structural Entities AMF
– AMF Application Application
• Represents the highest-level 1..*
service(s) provided by the
system
– Service Group (SG) Service
Group
• Represents a group of like
logical resources that provide
the same service(s)
• Associated redundancy model 1..*
(e.g. 1+1)
– Service Unit (SU) Service
Unit
• Aggregates a set of resources
which when combined provide
a higher-level service 1..*
– Component
Component
• Represents one or more
resources that perform a
function within the system
11. Availability Management Framework (AMF)
AMF Logical Entities
• Workload Entities AMF
Application
– Service Instance (SI)
1..*
• Represents a workload to be
supported by the system Service
Service
Service
Group Protected by
• Has associated redundancy Group
Group
requirements (1+1, N+M, etc.)
• Protected by an identified SG
• Assigned to one or more SUs 1..* 1..*
with an HA state of active,
Service
standby, quiescing or Service
Service1
Unit Assigned Service
quiesced Unit 1
Unit Instance
– Component Service Instance
(CSI) 1..*
1..*
• Represents a more granular
Assigned Component
workload that needs to be Component
Component
Component Service
supported by the system Instance
• Assigned to one or more
components
12. Availability Management Framework (AMF)
AMF Logical Entities
• Common Characteristics
– Well-defined state model for each logical
entity type
• Operational
• Administrative
• Etc.
– X.731 style administrative operations
• Lock
• Unlock CLC-CLI
• Shutdown Lifecycle
Scripts
• Etc. mgmt
AMF comp
process
• Common AMF Component Types AMF
HA state
assignment
AMF
Library
– SA-aware
– Non-proxied, non-SA-aware SA-aware Component Example
– Proxied, non-SA-aware
13. Availability Management Framework (AMF)
Service Group Redundancy Models
• Key redundancy model characteristics
– Preferred SI assignment model
• # of active resource(s)
• # of standby resource(s)
– Allowed concurrent HA state assignments
for SUs
– # of assignable SUs SI1
• Redundancy model options
– 2N A S
• Most common redundancy model
• 1 active resource and 1 standby SU1 SU2
resource per SI
A S
• SUs can have either all active or all Node1 Node2
standby SI assignments
– N+M
– No Redundancy SI2
– N-way
– N-way active 2N Service Group Example
14. Availability Management Framework (AMF)
Error Recovery Policies
• Pre-defined AMF component error recovery policies
– Configurable
– Can be overridden at runtime
• Up to 3 actions per policy
– Isolation
– Recovery
– Repair
• Recovery policy scopes
– Component
– Service Unit
– Node
• Recovery policy types
– Restart
– Failover
– Failfast
• Recovery escalation policies
15. System Management Services
Information Model Management (IMM)
• Information Model Highlights
– Based on pre-defined object classes
(including AIS classes)
– Holds both configuration and runtime
objects
– Used by AIS services to store current
configuration and runtime state info
– Can be used by applications as well
• Object Management API
– Object class management
– Access object attribute values
– Search information model
– Configuration change requests
– Administrative operation invocation
• Object Implementer API
– Runtime object management
– CCB validation and application
– Administrative operation handling
• OpenSAF Implementation
– Persistence of information model
managed through Persistence BackEnd
(PBE) feature
– Replicated to multiple cluster nodes
16. System Management Services
Software Management Framework (SMF)
• SMF controls migration
from one deployment Upgrade
“Upgrade
configuration to another Instructions” Campaign
Definition
• Upgrade methods
– Rolling upgrade Software
– Single step upgrade Management Adaptation commands
• [De-]Activation Unit Scope Framework (SMF config object)
– AMF Node
Install / remove - Admin operations
– Service Unit
software bundles - Read/Create/Delete/Update
• During the migration SMF on target nodes objects
– Maintains the campaign state
change model
– Takes measures to enable
error recovery
– Monitors for potential errors
caused by the migration Software Information
– Deploys error recovery
procedures Repository Model
17. System Management Services
• Notification (NTF)
– Publish-and-subscribe semantics for system-level notifications
– Syntax and semantics for ITU X.73x notifications:
• Alarm / security alarm / state change / object create/ delete /
attribute change
– Alarm and security alarm notifications automatically logged
through LOG service
• Log (LOG)
– Flexible, centralized, system-wide logging mechanism
– Pre-defined log streams: alarm, notification, system
– Multiple, custom application log streams allowed
– Configurable log stream characteristics including:
• log file full action: halt, wrap, and rotate
18. Application Services
• Checkpoint (CKPT)
– Intended as a state replication mechanism for distributed
applications
– Can be used for all standby “temperature levels”
• Cold
• Warm
• Hot
– Through OpenSAF CKPT service API extension
– Semantics of a checkpoint
• Arbitrary set of sections containing opaque data
• Stored in one or more replicas distributed across cluster
• Reads and writes occur against the active replica
– Both synchronous and asynchronous replication options
available
– Collocated checkpoint option provided for highest performance
19. Application Services
• Event (EVT)
– Publish-and-subscribe communication paradigm
– Flexible event channel, pattern, and filtering definition
– Subscriber event queue maintained within app process
• Message (MSG)
– Messages sent to and read from message queues
– Single message queue owner at a time
– Message queue maintained outside app process
– Message queues can be logically grouped
• Messages can be sent to a message queue group
• Associated distribution policy (round-robin, broadcast, etc.)
• Lock (LCK)
– Cluster-wide, distributed lock service
– Can be used to control access to cluster-level shared resources
20. Getting Started with OpenSAF
• OpenSAF Technical Educational Resources
– Developer Wiki [http://devel.opensaf.org/wiki]
– OpenSAF Developers blog [http://devel.opensaf.org/blog]
– OpenSAF mailing lists [Subscribe: http://list.opensaf.org/maillist/listinfo/]
• Users [Archive: http://list.opensaf.org/pipermail/users/]
• Development [Archive: http://list.opensaf.org/pipermail/devel/]
• Announce [Archive: http://list.opensaf.org/pipermail/announce/]
– Latest documentation [http://devel.opensaf.org/hg/opensaf-4.x-
documentation/archive/tip.tar.gz]
– FAQ
[http://www.opensaf.org/HOA/assn14944/images/FREQUENTLY%20ASKED%20QUESTIONS%20ABOUT%20OPENSAF%20RE
LEASE%204%20Final%20for%20publication.docx]
– README files in source code repository
• SA Forum Application Interface Specifications
[http://www.saforum.org/Service-Availability-Forum:-Application-Interface-Specification-
~217404~16627.htm]