I still remember the first hard drive I ever saw.
It was the mid-80s and the hard drive had a capacity of 20MB, which seemed a ridiculous amount of space. No one could actually imagine filling it up.
This hard drive was roughly the size of a washing machine. Black, shiny stacks of platters were loaded from above during a demonstration. The proud owners presented it as a marvel and dramatically exclaimed, “The reading heads would fly so fast and so close to the magnetic disks that it was akin to flying 500 miles per hour in a Boeing 747, just three feet from the ground”. This sheer power at speed meant it was easy to picture a catastrophic failure. And indeed such failures were frequent, meaning the disaster recovery plan was simply getting data back from the tape onto a new disk as quickly as possible.
Today’s modern hard drives truly are a marvel. A recent article from the BBC touts a 3.5in drive with a 40TB capacity in the near future. A lot has changed since 1985!
Disks are now hot-swappable. They sit in redundant arrays - immune to the failure of any individual disk. The arrays themselves are often replicated in near real-time to distant data centres. But of course disaster recovery plans are still put in place to get data back, though actual tapes are now a rarity.
We all hope that the disasters we envisage will never strike. But disasters do occur. How much they might affect us depends largely on planning and preparedness.
Preparing for modern times
Sometimes “Recovery Plans” are designed based on a set of assumptions that are dangerously anchored in the past, to that washing machine sized hard drive. When recovery procedures anticipate only disk-failure, they fail to offer adequate defence against software corruptions, bad updates, network and power outages or deliberate, malicious attack.
When I think of that hard drive I saw back in 1985, the most predictable disaster was that it would simply crash. Today, things are much more complex - enterprises are more interconnected, both internally and externally. The dependence on these systems remaining online is far greater, as are the costs of being offline. And the potential causes of disaster are now far more diverse. Successfully recovering from a disaster is largely determined by whether or not that scenario was anticipated and planned for beforehand. Where a specific set of recovery processes are laid out and practiced ahead of time, chances of success are far more assured. Below are four key steps to include when developing modern IT disaster recovery plans:
1. Monitoring and alerting
The first step of disaster and recovery planning is monitoring and alerting. Put simply - how will you actually know a disaster has occurred? A classical view of a disaster is a cacophony of telephones all ringing at once. Under this view, we could imagine noise levels rising as we look about to see monitors across the office frozen, perhaps showing hexadecimal codes tabulated across bright blue screens of death. No one could be left with any doubt that something was indeed very wrong in that scenario!
The presumption that you will immediately know something is wrong is neither a valid nor safe assumption to make. If something goes wrong after hours for example, your teams working in other timezones and needing a response might be only able to leave a voicemail. Or if a data-corruption ruins your accounting balances, will this be detected before month-end financials are run? Alternatively, if someone is deliberately encrypting your systems with the intent of ransoming the passwords, could they also be evading detection?
Given the diverse nature of possible disasters with modern systems, monitoring and alerting processes need to match. Disaster recovery plans need to be developed and implemented with fail safes and actual, diverse disaster scenarios in mind. Would a server be guaranteed to be able to email you in the event of a problem, or will that capability be offline for the same reasons? If someone is changing security settings, will anyone else know?
Ideally, the best alerting is completed by a witness server, that will still raise alarm if an all OK message or status cannot be verified.
2. Assessment and diagnosis
Once a problem has been identified, assessment and diagnosis is required. This will govern the specific recovery steps that are appropriate to take. Recovery options could range from a full failover of all servers to a second data-centre, to the replacement of one or more databases from backups, to the recovery of specific corrupt data-pages.
A traditional viewpoint based only on a simple disk failure would point to a limited set of disaster recovery responses. Quickly and accurately identifying which hardware, software, network or data is compromised, will ensure which specific recovery steps are appropriate and ultimately the best outcome.
3. Securing the site
Recovering a set of business systems to an earlier point in time often entails overwriting the database. It is definitely best to avoid this. Rather than simply replacing a damaged or compromised database using a backup, you should be able to move the damaged database files aside first. If that isn’t possible, you should at the least be able to complete a final differential or log backup before overwriting.
Getting last night’s tapes back into the server and then running the “Restore” program was once the only recovery option available.
In today’s work environment there are far more possible responses that address a far wider array of possible disasters. How successful and fast recovery can be depends on how well it has been planned for, and how recently that plan was practiced. Practicing real-world recovery scenarios regularly will also give a good indication of how long the overall recovery process might take to complete.
Planning for success
Working with business teams to plan and practice recovery from real-world scenarios is key to planning for a range of disasters that may occur. It’s less about being cynical and more about being realistic.
- Knowing about issues in a timely manner depends on good monitoring and alerting
- Correctly responding to different types of problems requires good planning, assessment and diagnostic capability
- Knowing how to secure a damaged system before replacing it will govern what can be fully recovered once a business is back online
- Practicing recovery scenarios to ensure a timely and successful response to disaster
The days of the washing machine sized hard drive are long gone, and it’s important we ensure our disaster recovery plans keep pace with modern threats to ensure the best protection.