This article explains how to use RTO and RPO together with your Business Continuity Planning (BCP).
What would you do if you had to restore a failed system and all its data? It is not always easy to put these theories into practice if you don’t have an identical lab to test your Disaster Recovery (DR) strategy.
One way to help you make an informed decision when designing your DR plan is to use RTO (Recovery Time Objective), and RPO (Recovery Point Objective) metrics. Tag this task onto your BCP, and you have all bases covered.
If you want to know if you can recover a system after a failure which has seen the loss of data, you will need to make some basic calculations.
The industry standard for this uses the metrics RTO and RPO. By providing some information on what you already know, and what you want to obtain, you will have a clear idea of what to expect following a disaster.
Two of the most important metrics every organisation and IT Directors must understand are RTO (recovery time objective), and RPO (recovery point objective). Together, these allow organisations to design and implement a robust disaster recovery strategy with a logical backup process which will allow them to restore every failed IT system within their targeted downtime parameters.
On their own, RTO and RPO values aren’t much use, and they must be used together. If you are responsible for planning and implementing a disaster recovery plan, these two metrics will help you get there.
A DR plan using RTO and RPO can and should be as simple as possible. A basic example could be a laptop which a sole proprietor needs to run their business from, or alternatively a more complex example of an ISP with 1,000 servers in a data centre. Both are businesses which are severely impacted when business continuity fails.
Each should examine their own RTO and RPO and check if they are achievable.
These two metrics when used together will specify how long a system can be offline for, and how much data can be lost from it before your business processes suffer significant harm.
When you agree on your metrics, you can then benchmark your systems and processes to see if they comply with what you need.
Both will help you calculate how often your backups should run, and what your acceptable restore recovery time will be. From there you will know how your data recovery process should work.
RTO is the time it takes to restore a previously working system and this must be within a previously defined Service Level Agreement (SLA). Returning a previous working system to normality after an outage is often referred to as Business Continuity.
A call centre (let’s call it Buddy Care) with 1,000 operators uses a customer management system to handle all customer enquiries. The system runs from a cluster of Linux servers and the company can’t deal with any customer queries or orders whilst it is offline.
Permitted System Downtime
Senior management has estimated they lose £10,000 in sales and £5,000 in customer churn when they can’t service their customers’ requirements for the first hour.
Customer churn cost doubles every hour when those customers become more frustrated. These values are affected further if the system is offline during a promotion or peak times.
From these figures, the company decides its SLA to recover from any system outage is 1 hour.
It is clear that a 1 hour SLA is probably unachievable, therefore the system will need improving by way of adding clustering, replication or a live-failover.
We have covered these services at the bottom of this article.
Now they know their existing system cannot be recovered within one hour following a disaster, they have started their DR planning. Simply by looking at RTO they now have a business case and budget to improve their systems so they can maintain their 1 hour RTO.
RTO: one hour
At Buddy care they decide they can lose a week of customer service calls but can’t lose any order details because these contain legal agreements and customer commitments.
The RPO statement says “we need all data recovered“. That is a perfectly normal response, however the business continuity SLA (RTO) of 1 hour dictates the RPO. If they need 10 hours to recover all data, that cannot be done with an RTO of 1 hour.
Permitted Data Loss
In an environment like Buddy Care’s, there will always be some data lost after a restore because the transactions are happening every few seconds, but the backup only runs once every hour. It is commonplace for the live backup sets to contain the most recent data and for historical data to be stored in another data set which is infrequently accessed. After the live data has been restored, it is expected historical data (6 months or more) will take longer to restore. The durability of that data isn’t in question, it is simply the persistence of the service which serves that data. This is the difference between the persistence of a service (is it available now?), and the durability of data (can we restore all the data?).
RPO is a very significant figure and should be impressed on senior management that it is a way of mitigating and managing data loss after the event. They should be aware of what type of data won’t be initially restored and whether it will be restored at a later date.
Let’s not be technical here though. Even though we are talking tech and maths here, the conundrum is the same and for senior management.
RPO calculation (which data can be lost)
One week of customer service calls.
No order details can be lost.
One week of customer service calls.
No order details can be lost.
RPO: one week of customer service calls. No order details can be lost.
Because this is a critical business decision, the RTO and RPO metrics should be decided by Senior management.
The IT Team will be required to confirm what is possible with the existing systems and recommend what is required to achieve the RTO and RPO.
This must be thorough and regularly tested.
If a backup takes 2 hours, don’t assume the restore will take 2 hours. During a restore, you might need to procure and replace hardware or rebuild a system from scratch before a restore can start. These processes can take days if you haven’t envisaged what is required or don’t have access to replacement hardware.
What if your server goes offline at 06:00 on Sunday? Can you buy another one before Monday, what about restoring the operating system before you can restore your databases and applications?
Without a thorough restore test, your DR plan will be nothing more than a dream. Ideally, you want to be doing test restores rather than live restores. If you are doing a live restore when a system is down, it is fair to say someone has messed up somewhere. EVERY PC and server you install nowadays will have a whole lot of diagnostic and error checking going on. More than 80% of the DR situations we get involved with are because of faulty hard disks which have been reporting impending errors for days or years. A disk in a RAID array fails and the RAID fault tolerance level is managing things nicely until another disk fails and the array dies. We most commonly see RAID, 5,6 and 10 in use. These are expensive systems and can easily be kept healthy by swapping disks during a first-instance failure.
Motherboards and memory failures are different. With memory, we will normally see some CRC errors starting before they fail. Regarding the others, try dual power supplies, keeping spare parts or even a spare server. Your RTO and RPO calculations will dictate your budget, i.e. do you keep a spare server, cluster your servers, or have live failover replication between your premises and data centres?
Now we have the hardware DR plan written, let us automate things.
Backups can easily be automated, and the results should be monitored by a competent member of staff. The action of staff monitoring backups should be logged somewhere so that management knows the logs are being checked fully and not just given lip service.
Periodic restores are very important and are probably the only method you have to prove you can restore your systems. It is not uncommon for a Sysadmin to move data within a server’s drives or to another server when extra storage is required. This should be communicated to the team who are responsible for data backups so the backup sets can be amended.
Some other issues we have seen when a restore is needed are:
The encryption password is wrong or has been lost (without these, data cannot be restored).
The system architecture is unknown when rebuilding a new server (what disk sizes and partitioning did we use on the server before it crashed?).
The media where our data is stored isn’t available (this used to be physical tapes, but nowadays is more likely to be insufficient bandwidth download for a cloud restore).
The only way to deal with this is through regular restores and benchmarking. Virtualisation and low-cost hardware make it easy to recover an entire server to a restore lab so you can check if your RTO is maintained.
Calculate your RTO and RPO values and have these approved by senior management.
Remember, these are business decisions and not solely IT decisions.
Decide if the RTO and RPO can be achieved with the existing systems and backups. If yes, then test it in your lab.
Identify other risks, such as power, substations, networking, etc. Your systems might be 100% bullet-proof, however, you might be let down by single points of failure from your power or connectivity providers.
RTO and RPO are calculations to assist you in calculating your DR strategy. Of course, you hope you never need to put them to the test.
With the correct planning, support from management, and a good IT budget, your IT system should be able to self-heal and preferably not fail in the first place.
Data corruption and ransomware can still occur. When they do, corruption is simply spread amongst all the storage locations. Every Sysadmin knows this, however, it is more commonplace nowadays with increasing cybercrime and ransomware.