When was the last time you tested your backup plan?
Most critical systems now have a redundant solution and sometimes even a third option if a failure was to occur. Just because they have been designed and installed it doesn’t always mean that they will be there when you need them. Designing a system with a backup and redundancy gives us a nice warm feeling that it will be ok if something goes wrong, however as with many things in life, if it is not used could fail without you even knowing, and letting you down when it is needed most.
So how can you authentically prove your system is trustworthy?
Schedule a failure test. This could be a small site specific failure or even on a large scale data centre failure. This will instantly highlight what is and isn’t working in a controlled manor.
The hardware, whether physical or virtual, can be reviewed and analysed on the behaviour during the outage. Any issues can be easily rectified and any hardware failures can be repaired or replaced.
Some examples could include a satellite dish having the line of sight blocked over time due to tree growth. The system has been designed to have the satellite communication automatically switch on and transmit or receive data in the event of a cellular failure, but in this instance the communication could fail or prove to be unreliable and intermittent. The site could be highlighted of concern and a site visit arranged to analyse the issue. The trees can then be cut back and the issue resolved in a short space of time at little cost.
Another example could be a firmware upgrade that has disabled or changed configuration of the redundant system. Unfortunately firmware upgrades can affect features or settings that do not need to be updated, and may not be highlighted until the system is tested in anger. These subtle configuration changes can be hard to find but again this can be addressed in slow time and updated as required preventing catastrophic problems in the future.
This not only tests the backup infrastructure required in such a scenario, but also the personnel and processes in place. It is during time spent replicating such scenarios that a full understanding of what, who and even where can have the biggest impact.
Changes in hardware, software and services could also mean that what was once ‘good enough’ for secondary systems, may evolve to be a more reliable primary solution. When a system is installed and not used, we may miss some of the functionality and capabilities it provides. By testing and bringing it to the forefront of our minds it may change our view, enabling a more suitable or innovative solution to shine through.
Once a run through has been accomplished, there is no reason why the same scenario cannot be repeated on a regular basis. It can become part of a quarterly, 6 monthly or annual process, that way the system remains regularly maintained and monitored. We check our life saving PPE so why not check those critical systems regularly as well?
Dust off those servers, turn on the coffee and prepare for some out of hours work. Its about time those oh so reliant primary solutions got a break!