The danger of single points of IT failure

Posted on March 27, 2018 by Louise Howland

A single point of failure put simply is a part of a system, which if it fails it will stop the entire system from working. This can be extremely damaging to an organisation, as the revenue loss from downtime for any business is significant, so single points of failure should be identified and where possible eliminated. When it comes to IT to an extent your system is only as strong as your weakest link, however not all links are created equal. There may be some single points of failure that are acceptable as they are quick to resolve, or prohibitively expensive to avoid. Organisations often need to apply the 80/20 rule, you can spend 20% to address 80% of the single points of failure (like having servers with redundant power supplies, RAID disks etc, multiple hosts with shared storage) but the last 20% will cost 80% of the budget to resolve (redundant connectivity coming into the building via different routes, generator backup etc). At ramsac we usually aim to deal with the major single points of failure that are relatively low cost to address, then look at other business continuity options to deal with the rest. So rather than installing generator backup power, most organisations will have systems replicated to another site, or in the cloud which can be used in the event of a power failure. That addresses far more potential issues, and costs a fraction of installing and maintaining a generator system.

Don’t overlook the obvious

Sometimes the single point of failure is right in front of your nose and for that reason it may be overlooked. A few years ago there was an organisation that had generator backup for power, which they diligently tested on a regular basis and it worked perfectly, until they actually had a full power failure and they couldn’t start the generator as the starter motor was powered from the mains supply. Another example of overlooking the obvious was in an organisation where the servers were highly resilient, but they kept having failures in part of the network once a week on a regular basis. In the end it turned out that a cleaner was unplugging a switch to plug in their vacuum, taking a big part of the network down each time. Just shows that you need to think about the system as a whole, not just focus on the big servers in the middle.

Examples of single points of failure

People – For some organisations it is not the hardware or software that provides the single point of failure but a person. Often you might have one or two people who are responsible for several systems, these systems usually require specialist knowledge to operate them, resolve problems and recover in the event of a problem. If the person responsible for this is on holiday, off sick or leaves the company it may leave a knowledge gap which could be devastating to a business if something goes wrong. Understanding which employees are potential single points of failure and putting processes in place to ensure there is documentation and training that shares their knowledge can help organisations overcome this issue.
Hardware – Hardware is probably the most obvious single point of failure and usually the most critical. If server goes down and you don’t have any backup systems it can bring everything in your organisation to a standstill. Or if a router fails and users lose internet connectivity it can be a major problem if you work in the cloud. Having another router available that can be used for redundancy, even if it can only be used for a few critical tasks can keep a business operating in the short term.
Services/Providers – If one of your suppliers has a problem or outage at their end, it can directly impact your organisation and become a single point of failure. Especially if they house your offsite data or provide your internet or voice services. By having a back-up plan in place to deal with issues outside of your control you can prevent issues from negatively impacting your organisation

Your single points of failure

When determining the single points of failure in your organisation, it is important to ask yourself

What happens if this system fails?
What happens if any service dependency I have fails?
What happens if this person is off ill

Spotting and removing single points of failure

The best place to start is to carry out an audit, by reviewing all elements of your IT Infrastructure; including, your ISP, email provider, software, other external IT services, servers, storage devices, Laptops, computers, telecoms systems and people. (Basically, anything that is connected to your network.) It is important to be thorough and include types of equipment their age, support contracts that you have in place, also within this document show the links between your IT infrastructure, what is the knock-on effect on other parts of the system if you lose internet connectivity or a server crashes. List out your single points of failure, create a matrix to show ease and cost of fixing vs the effect of the single point failing, to help you prioritise which you can tolerate or address using business continuity and which must be fixed. As Benjamin Franklin said by failing to prepare, you are preparing to fail. At ramsac we can help you identify your single points of failure as part of the free IT health check we offer to organisations to make sure your technology and information assets are working for, not against, you. Learn more in our free guide and contact us for more information.