What is High Availability?

Published: 29 July 2024
Contributors: Mesh Flinders, Ian Smalley

What is high availability?

High availability (HA) is a term that refers to a system’s ability to be accessible and reliable close to 100% of the time.

Highly available systems must be able to withstand outages, including scheduled downtime and site-wide disasters. Typically, HA systems meet two characteristics:

They must be available for use close to 100% of the time.
They must be able to meet a certain set of predetermined user expectations.

With the growth of digital transformation initiatives and the subsequent move of many services to the cloud, high availability solutions are now offered by many tech and software as a service (SaaS) companies, including Microsoft, Amazon (AWS), IBM®, Red Hat® and more.

High availability of IT systems is particularly important in industries where critical applications rely on having little or no system downtime. For example, in hospitals and data centers, users depend on high availability solutions to perform many routine, daily functions. If users can’t access a system for any reason, it is deemed ‘unavailable.’ The period of time that a system is unavailable to users is known as downtime.

High availability (HA) versus Disaster Recovery (DR)

Disaster recovery (DR) consists of IT infrastructure technologies and best practices designed to prevent or minimize data loss and business continuity disruption resulting from catastrophic events. High availability (HA), on the other hand, typically concerns smaller failures or faults that might impact a systems’ availability.

Even though they are different, DR and HA both share the goal of minimizing disruption to IT systems, and both typically employ redundant components and redundant systems as part of an overall strategy. Also, both DR and HA use data backups to make data available in case of a wide range of problems, including hardware failures, software failures and power outages.

High availability versus fault tolerance

Fault tolerance is a system’s ability to operate continuously after one or more of its critical components fail. Like HA, fault tolerance can help make a system available during or after a disruptive event.

However, where fault tolerance and HA differ is in the way they treat downtime. While HA seeks to have as little downtime as possible, the goal of fault tolerance is zero downtime, a goal it can only achieve through redundancy, having a backup or secondary copy of every single component in the infrastructure.

Subscribe to the IBM newsletter

Benefits of high availability

With enterprises relying more than ever on online services and cloud and hybrid cloud architectures to deliver critical applications and services, infrastructure demands are rising, making high availability a priority. Here are some of the most common enterprise benefits of highly available systems.

Increased flexibility

With digital transformation a key objective for most companies, high availability of systems is critical to giving employees and customers unlimited access to critical applications¹.

Secure data

With high availability architecture, organizations’ most important data is always available, accessible and protected from unauthorized breaches.

Improved brand reputation

System failures that cause hours or even minutes of downtime can cause public relations nightmares for enterprises across a broad range of industries, including SaaS, aviation and mobile technology². High availability infrastructure ensures a brand’s reputations doesn’t suffer due to an outage or unexpected downtime.

Better customer service

Managed Service Providers (MSPs) must deliver high availability of networks or risk not fulfilling their service level agreements (SLAs). HA systems help MSPs deliver networks that their most valuable clients can depend on, like one that helps autonomous vehicles drive safely or a healthcare facility manage patient records.

How to achieve high availability

Whether seeking to achieve zero downtime in an industry like healthcare or finance, or simply looking for ways to avoid reputational damage from outages, businesses looking for high availability typically follow a 4-step process.

Remove single points of failure: Single points of failure are components that would cause an entire system to stop functioning if they were to fail. For example, if an array of servers is operating on a single network switch and that switch fails, every server on the network will fail. A tactic called load balancing—where work is distributed across a systems’ capacities—is often used to reduce and even eliminate single points of failure.
Create reliable failover: Failover is the transfer of workloads from a primary system to a secondary system in the event of a failure on the primary system. When businesses create reliable failover, workloads can be easily transferred without significant downtime, data loss or drop in operational performance.
Detect failures instantly: High availability depends on having processes in place to detect failures or faults in a system the instant they occur. Many modern systems have built in, automated failure detection. Some can even detect a failure and choose the next course of action, like implementing a failover process.
Build strong data backup and restore capabilities: When individual parts of a system fail, data can be lost if the proper backup and restore procedures aren’t in place. Data protection technologies and practices make periodic copies of data and applications to a separate, secondary device so data and applications can be quickly recovered.

Load balancing

Many HA systems use load balancing, the process of distributing traffic among multiple servers to optimize application availability. For example, with a high-traffic website or cloud service, a system receives millions of user requests every day. Load balancing ensures that applications can deliver content from web servers to users promptly and without interruption. Load balancing, especially the use of many load balancers at once, can help guarantee no single component in a system is overwhelmed leading to a single point of failure that might cause downtime or an outage.

Redundancy

Redundancy—having a secondary or backup component available to take over when a primary one fails—is an important part of a high availability system. Redundancy enables databases to remain available to users and applications even when a component isn’t functioning. If a component in a system is not redundant, that component would be considered a single point of failure, as losing it might potentially stop the whole system from working.

High availability clusters

High availability clusters, also known as high availability clustering, are groups of connected machines that work together as a single system. When one machine in a cluster fails, cluster managing software transfers their workloads onto another machine. Within a high availability cluster, shared storage between each node (computer) ensures zero data loss if a single node stops functioning.

How do you measure high availability?

High availability is measured in relation to a system being 100% operational, or never having a single outage. While no system can be 100% operational, setting that as a goal helps in measuring how available a system is over a period. The most common metric for high-availability systems and services is something called five nines availability.

Five nines availability

Five nines availability means that a system can run and perform 99.999% of the time. Typically, only systems in highly critical industries, such as healthcare, transportation, finance or government, require five nines availability. These systems are important to people’s lives, access to food and shelter and economic well-being.

Systems that don’t operate in these highly critical industries typically don’t require as much operational availability and can make do with ‘three or four nines’ (99.9% or 99.99%) availability. Another way this is frequently described is to say a highly available system has “99.9/99.999% uptime.”

Other important metrics: Mean and recovery times

In addition to five nines availability, IT systems managers use several other key metrics to measure how available their systems are:

Mean time between failure (MTBF): Mean time between failure (MTBF) is a measure of the reliability of a system or component. It’s a crucial element of maintenance management, representing the average time that a system or component will operate before it fails. The MTBF formula is often used in the context of industrial or electronic system maintainability, where failure of a component can lead to significant downtime or even safety risks, but MTBF is used across many types of repairable systems and diverse industries.
Mean time to repair (MTTR): Mean time to repair (MTTR), sometimes referred to as mean time to recovery, is a metric that is used to measure the average time it takes to repair a system or piece of equipment after it has failed. MTTR includes the time from when the failure occurs to when the system or equipment is fully functional again. This includes the time it takes to detect the failure, diagnose the issue and fix the problem. MTTR is an important metric to monitor because it evaluates the availability and reliability of systems and equipment.
Recovery time objective (RTO): Recovery time objective (RTO) is the length of time that it takes to recover from an outage (scheduled, unscheduled or disaster) and resume normal operations for a system, application or set of applications. The RTO might be different for scheduled, unscheduled and disaster-recovery outages.
Recovery point objective (RPO): Recovery point objective (RPO) is the point in time relative to the failure to which you need preservation of data. Data changes preceding the failure or disaster by at least this time period are preserved by recovery processing. Zero is a valid value and is equivalent to a "zero data loss" requirement.

High availability examples

As organizations across many industries undertake broad digital transformation initiatives, the availability demands on their infrastructure are increasing. Remote work and the spread of 5G networks have made it normal for users to expect to be able access data and applications from anywhere at any time. But only if the underlying systems powering the applications and regulating access to the data are available. Here are some examples of highly available systems that help modern enterprises thrive:

Electronic health records (EHR)

Gone are the days when a doctor flipped through files in a cabinet to find the date of your last vaccination. Today, if you show up at the emergency room or specialists office, it’s almost certain your doctor will access your records online. Because of the critical and private nature of this kind of information, EHR are an example of a highly available system that can securely deliver accurate information within seconds with close to zero downtime.

Autonomous vehicles

Driverless, or autonomous, vehicles, such as cars, drones and others, rely on fast, powerful internet connections so the artificial intelligence (AI) that controls them can function. When an autonomous vehicle pulls up to a stop light, for example, tens of thousands of pieces of data are being processed in near real time so that it stops at the light where it is supposed to and proceeds on to its destination. High availability is crucial to the safe operation of autonomous vehicles of all kinds.

Internet of Things (IoT)

The Internet of Things (IoT) is a network of physical devices, vehicles, appliances and other objects that are embedded with sensors connected to the internet that allow them to collect and share data. As the IoT ecosystem expands into roads, waterways, home appliances, weather monitoring and more, millions and millions of devices are relying on networks. High availability helps ensure that networks supporting IoT devices run smoothly and without interruption.

Big data

As enterprises find more ways to use the massive amounts of data they generate in the digital age, high availability is essential to efficient, effective data processing. Data centers and complex analytics platforms perform continuous data processing and real-time analysis and downtime can set back projects by months. HA solutions help enterprises have 24/7/365 access to their most important data.

Footnotes