The Magazine for Developers of Open Communication, Industrial, and Rugged Systems
ARTICLES
PRODUCTS
NEWSWIRE
VENDORS
E-LETTER
E-CAST SCHEDULE
 

High availability

Design must-haves for your mission-critical system

By
GoAhead Software

Service availability, or uninterrupted service to the end user in the presence of failures, cannot be effectively bolted on to an already-developed system if little or no consideration has been given to High Availability (HA) during the design phases. In this article, Dr. Naseem outlines the key considerations a designer of a highly available system must keep in mind.

From a designer’s perspective a Base Station Controller (BSC) provides three major functions: Operations, Administration, Maintenance and Provisioning (OAM&P); Call Control (CC) and; Media Control (MC). These functions are related to and reside on the management plane, the control plane, and the data plane of the system, respectively (Figure 1).

Figure1
Figure 1: A high-level view of network elements
(click graphic to zoom by 1.2x)

 

OAM&P provides overall system management capabilities that allow an operator or a management application to monitor, query, modify, and manage a host of information including configuration data, events, and logs. In addition, it performs administrative operations to take various system resources in and out of service. CC manages the control plane of the system: It provides the required signaling capabilities, maintains detailed records of the voice calls through the system, and communicates with the other functional modules such as OAM&P and MC. MC oversees the data plane by managing the switching configuration for the voice media and communicates with OAM&P and CC.

The designer must also take into account the nonfunctional requirements to ensure the system will perform under varying operating conditions. These requirements include fault recovery times, failover times, query response times required of various services, system performance under increasing call workload, and scalability of the system as the overall call volume increases. Furthermore, the system must be designed to support extensibility, testability, and upgradability.

Once functional and nonfunctional requirements have been established, clearly documented, and understood, the system designer must consider the target hardware configuration that can support these requirements. In order to ensure carrier-grade availability (99.999 percent or higher uptime), the system configuration must provide redundant hardware components to eliminate any potential single points of failure. Of course the chosen hardware configuration must support the required software functionality and its distribution throughout the system.

With a bladed chassis for the BSC design, designers can configure a variety of blades to achieve different redundancy requirements. For example, the management plane software can be resident on redundant system management blades to achieve OAM&P functionality. A couple of general-purpose compute blades can be employed in a redundant configuration to host the control plane software and the CC functionality. Finally, hosting the data plane on redundant pairs of line cards makes MC functionality possible (Figure 2). The amount of redundant hardware resources required depends upon other factors such as the desired redundancy policies (for example: 1+1; 2N; N+M).

Figure2
Figure 2: A high-level view of deployment configuration
(click graphic to zoom by 1.6x)

 

Theory and modeling

Once the requirements have been clearly established, the system designer is faced with the task of describing the software functionality needed to meet such requirements. Traditionally software requirements are written and handed over to engineers to implement the requisite software functionality. A more cost-effective method for achieving functionality has emerged recently: SIGs such as the SA Forum have successfully defined and standardized the interfaces for key services required to implement HA middleware services, pairing these interface specifications with a detailed API programming model. The SA Forum has made significant strides in creating HA middleware as a commercial software category, with multiple vendors offering implementations of middleware that are based on the SA Forum specifications.

At the hardware level the Hardware Platform Interface (HPI), a set of libraries provided by the hardware supplier, serves up the detailed hardware information such as the type and number of cards, sensors, controls, fans, power supplies, etc., to the Service Availability (SA) middleware that can subsume this information in a way that protects it from having to know the specific details of the underlying hardware or its architecture.

As long as the hardware supports the HPI interface, a hardware abstraction layer enables the SA middleware that communicates with the HPI functionality to move easily across different hardware implementations. An Application Interface Specification (AIS) presents a similar abstraction layer to the application developer. Once an AIS-compliant application is developed, it can be moved across multiple underlying SA middleware implementations as long as the middleware includes AIS functionality. AIS provides a host of services relevant to the BSC design example at hand and a powerful HA framework along with a standardized information model management capability. Many of these services will be referenced as we discuss the various functional blocks. The SA Forum reference architecture also allows for incorporation of other important functionality that exists in the commercial market but has not yet been addressed by the Forum’s specification work.

Key SA Forum system modeling concepts will be used in this example. The Availability Management Framework (AMF) defines a set of logical entities that allows the system designer to model a sophisticated system for high availability. For example, a Component (C) refers to a hardware or software resource that needs to be managed, a Service Unit (SU) is a collection of components that cooperate to offer a desired service. Multiple redundant SUs that perform the same service are grouped together to form a Service Group (SG), and finally, any SU’s assigned workload is a Service Instance (SI), and has a HA state of active or standby.

Practice

As the market adopts the work of SIGs such as SA Forum, an ecosystem of COTS suppliers is emerging with commercial and open source implementations of HA middleware, application services, and applications. One such implementation, SAFfire high availability and management middleware, from GoAhead Software, Inc., will be referenced to illustrate that in practice it is possible to construct a carrier-grade BSC using COTS components.

The BSC

Let us now apply the taxonomy described earlier to the design and development of the base station controller.

OA&M

Requirements: This functional block addresses the management plane functionality of the BSC and must meet a set of key requirements:

n  Provide system and network management interfaces to monitor, query, and modify management information

n  Host management applications that enable fault prediction/correlation and implement custom hot swap management policies

n  Send configuration information to other relevant functional blocks in the system

n  Manage middleware and application configurations, including modifying existing and provisioning new configurations

n  Monitor – and generate as warranted – various system alarms and notifications

n  Make system logs accessible for inspection and analysis of possible system failures


In order to provide high availability for this system level function, an active/standby configuration utilizing a 1+1 (or 2N) redundancy model is appropriate.

OAM&P theory and modeling

The OAM&P requires a set of system-level middleware services that bind the system into one coherent entity. These services must enable one to describe the system (that is, create an information model) and supply a mechanism for reporting system-wide information. Various services mentioned in the SA Forum reference model can be employed here to achieve these objectives. The Information Model Management Service (IMM) is employed as a mechanism to describe the system; then logical entities within a variety of services are used to construct the information model. The Cluster Management Service (CLM) and its logical entities are employed for cluster membership description, and the Platform Management Service (PLM) and its logical entities handle hardware platform description. AMF is used to define the redundancy policies for various services as well as the application. The Log Service (LOG) can be used as a central repository of all system activity, and the Notification Service (NTF) as a mechanism for notification of events through the system.

OAM&P practice

The system model adopted here neatly divides labor between the designers of the systems and the implementers of the system capabilities. Using the SA Forum model, the developers can build discrete system capabilities (that is, components) that are built once and deployed in a variety of different system configurations. This is possible because the design choices are not a matter of implementation, but a matter of AMF system model configuration. Once the components have been developed, the designer then has the flexibility to create SUs and SGs to implement various functional blocks as well as implement desired redundancy policies – 2N in this case.

All capabilities mentioned here for the design and implementation of the OAM&P functional blocks are available in commercially available products, and have been used in such a system design.

Call Control

Requirements: This functional block addresses the control plane functionality of the BSC and must meet a set of key requirements:

n  Manage the signaling, setup, and tear down of calls flowing through the BSC

n  Maintain call records of all calls through the system for billing purposes

n  Communicate with the OAM&P functional block to export relevant billing information

n  Communicate with the Media Control functional block to manage the media switching configuration

 

In order to provide high availability for this function, an N+M redundancy model is assumed in this example.

CC theory and modeling

The CC function requires a set of application services that can be used by various components on an as-needed basis. There are no system-wide implications on the use of these middleware services. Four key application services need to be available to effectively implement this functional block.

Again, a set of services mentioned in the SA Forum reference model can be employed here to meet the objectives. The Event Service (EVT) provides an application with a publish/subscribe mechanism that is based on the concept of event channels, where a publisher communicates asynchronously via events with one or more subscribers. An efficient messaging engine is foundational to any control plane functionality. The Message Service (MSG) bases its buffered message passing system on the concept of message queues where an application can write/read messages to/from the same queue. A checkpoint capability is needed to record an application’s state information and is replicated to a standby resource, such that in the event of a failure the state information can be used to affect a seamless failover. The Checkpoint Service (CKPT) can be employed to achieve such functionality. The checkpoint information must be stored and retrieved in real time to achieve seamless recovery from failures, hence the need for an in-memory, high-performance data store such as Management Data Store (MDS).

CC practice

The necessary design choices here include the communication framework needs to provide for stateful failover. The state replication mechanism must provide synchronization capabilities so that correct replicas are always used independent of how quickly – or slowly – the standby systems assume the active role. And there must be no configuration and statistics management issues if there is a synchronization issue between active and standby. The designer must plot out a proper distribution of redundancy models to use for control plane applications. These design decisions and associated policy configurations manifest as an exercise in AMF system model configuration – taking the developed components and plotting them into the logical and physical world of the overall system.

This N+M redundancy policy is implemented for this functional module.

Media Control

Requirements: This functional block addresses the data/user plane functionality of the BSC, and must meet a set of key requirements:

n  Manage the switching configuration for the voice media

n  Obtain switching configuration information from the OAM&P functional module

n  Obtain switching control requests from the call control functional module

 

In order to provide high availability for this function, a 2N redundancy model is assumed.

Media Control theory and modeling

Typically these functional requirements tend to be generic in nature and are most often met at the firmware level in the hardware platform. Thus none of the standard services, such as those prescribed by the SA Forum interface specifications, apply here. However, there are other related important issues that must be addressed. One important area is the software upgradability – that is, design and develop software such that it can be upgraded in-service without impacting service availability. A simple scenario calls for a rolling upgrade, which allows for active SUs on an active node to gracefully and seamlessly migrate to a standby node, while the active node is brought down for upgrade. Once the upgrade is completed, the process can be repeated with the other node.

Media Control practice

Often legacy software components – or (external) firmware-based functionality such as the Media Control – do not easily lend themselves to direct modification by the middleware. So what is one to do if such critical functionality needs to be maintained and used in a highly available system such as the BSC? Fortunately, a recent practice standardized by the SA Forum reference architecture can effectively manage high availability in such circumstances. A proxy component can be developed to mediate between the proxied component – the external firmware in this case – and AMF. AMF then has explicit control on the proxied component through callbacks, thus a suitable redundancy policy can be implemented to ensure high availability of such functionality. This scheme has been successfully used to implement a 2N redundancy model for this functional component of the BSC.

The Base Station Controller

Figure 3 shows how the three functional blocks, OAM&P, Call Control, and Media Control, which have been designed using state-of-the-art theory and modeling and developed using most current practices are pulled together in a highly available BSC. The OAM&P functionality has been made highly available through the use of standardized services and redundant SUs brought together in a service group implementing a 2N (1+1 in the limiting case) redundancy model. Similarly the call control module utilizes standardized HA services forming four SUs that form a service group implementing an N+M redundancy policy. The media control utilizes a special mechanism that allows it to represent – and HA-enable – an external or a legacy component such as functionality implemented in the firmware. Media Control functionality has been proxied by proxy components that are used in a 2N redundancy policy.

Figure3
Figure 3: A Deployment-Ready Base Station Controller.
(click graphic to zoom by 1.6x)

Final thoughts

As stakes for mission-critical applications rise, so does the demand for alternatives to costly, inflexible proprietary systems. Open specifications such as those developed by the SA Forum are propelling an ecosystem of commercially viable, high availability products. Implementations of these specifications employ design services that can be utilized to create highly available network elements such as a base station controller that are designed from the ground up to accomplish uninterrupted service availability. Solutions such as GoAhead SAFfire availability and management middleware are proving to be a real and fast-growing approach to well-thought-out designs that meet carrier-grade requirements and are flexible enough to adapt to new technology, all in a manner that makes the best use of financial and human resources.

Dr. Asif Naseem is President and Chief Operating Officer of GoAhead Software, Inc. in Seattle, WA. He is also the President of the Service Availability Forum. He has more than 20 years of experience in the computer and communications industry. Asif is a frequent speaker at industry and academia events, and publishes regularly on computer and communication-related topics in various magazines and journals. He has an MS in Electrical Engineering and a PhD in Computer Engineering from Michigan State University.

Acknowledgements

The author wishes to thank Steve Mills, GoAhead senior systems engineer, and David Fick, GoAhead chief architect, for their consultation and expertise.

References

[1] Asif Naseem, “Building a Highly Available Base Station Controller Using COTS Components”

[2] The Service Availabilty Forum, “Application Interface Specification – Release 6,” (2008), http://www.saforum.org/specification/AIS_Information/

 

 

 

CompactPCI and AdvancedTCA News with RSS Link
Related: compactpci and advancedtca systems,high availability compactpci and advancedtca systems, high availability

©MMX CompactPCI AdvancedTCA & MicroTCA Systems. An OpenSystems Media, LLC publication.
Last updated: 07/29/10 10:01 America/Phoenix
ARTICLES   PRODUCTS   PREFERRED VENDORS   NEWSWIRE   EMBEDDED FORUM   eLETTER   SUBSCRIBE FREE >
About this Magazine and Website | Contact Us | Media Kits | Reload this page