Revised August 2020

RTO and RPO overview

This article explains the main fundamentals of Business Continuity Planning (BCP).

Have you ever wondered what you would do if you if you had to recover a failed system and all its data? It is not always easy to put these theories into practice if you don’t have an identical lab to test your Disaster Recovery (DR) strategy.

One way to help you make an informed decision when designing your DR plan is to use RTO and RPO.

If you want to know if you can recover a system after a failure which has seen loss of data, you will need to make some basic calculations.

The industry standard for this uses the metrics RTO and RPO. By providing some information on what you already know, and what you want to obtain, you will have a clear idea of what to expect following a disaster.

RTO and RPO means (recovery time objective), and (recovery point objective) respectively.

You only need to consider RTO and RPO for the systems you want to keep alive!!

RTO and RPO explained in detail

Two of the most important metrics every organisation and IT Directors must understand are RTO (recovery time objective), and RPO (recovery point objective). Together, these allow organisations to design and implement a robust disaster recovery strategy with a logical backup process which will allow them to restore every failed IT system within their targeted downtime parameters.

On their own, RTO and RPO values aren’t much use, and they must be used together. If you are responsible for planning and implementing a disaster recovery plan, these two metrics will help you get there.

A DR plan using RTO and RPO can and should be as simple as possible. A basic example would be a laptop which you need to run your business from, or alternatively, you might be in charge of 1,000 servers in a data centre. Both are business processes which severely impact on business continuity, and they apply to all decision makers whether they are techies or non-techies.

How will RTO and RPO help my organisation?

These two metrics when used together will specify how long a system can be offline for, and how much data can be lost from it before your business processes suffer significant harm.

When you agree on your metrics, you can then benchmark your systems and processes to see if they comply with what you need.

Both will help you calculate how often your backups should run, and what your acceptable restore recovery time will be. From there you will know how your data recovery process should work.

Know what happens to your data

  • Encryption explained
  • What happens to your data at rest

RTO (Recovery Time Objective)

RTO is the time it takes to restore a previously working system and this must be within a previously defined Service Level Agreement (SLA). Returning a previous working system to normality after an outage is often referred to as Business Continuity.

Example:

Scenario
A call centre (let’s call it Buddy Care) with 1,000 operators uses a customer management system to handle all customer enquiries. The system runs from a cluster of Linux servers and the company can’t deal with any customer queries or orders whilst it is offline.

Permitted System Downtime
Senior management have estimated they lose £10,000 in sales and £5,000 in customer churn when they can’t service their customers’ requirements for the first hour.
Customer churn cost doubles every hour when those customers become more frustrated. These values are affected if the system is offline during a promotional period or peak time.

RTO calculation
From these figures, the company decides their SLA to recover from any system outage is 1 hour.
It is clear that a 1 hour SLA is probably unachievable, therefore the system will need improving by way of adding clustering, replication or live failover. 

We have covered these services at the bottom of this article.

RTO is One hour

RPO (Recovery Point Objective)

Scenario
At Buddy care they decide they can lose a week of customer service calls, but can’t lose any order details because these contain legal agreements and customer commitments.

The RPO statement says “we need all data recovered“, however the business continuity SLA (RTO) of 1 hour supersedes the RPO at all times.

Permitted Data Loss
In an environment like Buddy Care’s, there will always be some data lost after a restore because the transactions are happening every few seconds, but the backup only once every hour. It is commonplace for the live backup sets to contain the most recent data and for historical data to be stored in another data set which is infrequently accessed. After the live data has been restored, it is expected historical data (6 months or more) will take longer to restore, and will be restored. This is the difference between the persistence of a service (is it available now?), and the durability of data (can we restore all the data?).

RPO is a very significant figure and should be impressed on senior management that it is a way of mitigating and managing data loss after the event. They should be aware of what type of data won’t be initially restored and whether it will be restored at a later data.

Let’s not be technical here though. Even though we are talking tech and maths here, the conundrum is the same and easy for senior management.

RPO calculation (which data can be lost)
One week of customer service calls.
No order details can be lost.

RPO is: One week of customer service calls. No order details can be lost.

Real World examples

Buddy Care are a call centre with 1,000 operators and they use a customer management system to service their customer base of 2 million + accounts. 

Their system runs from a cluster of Linux servers and the company can’t deal with any customer queries or orders whilst it is offline. 

Nosebag City are a restaurant chain, and the booking system goes down @ 02:00, the RPO would be zero or 100% depending on how you state your metrics. The restaurant closes at 23:00 and the booking system is taken offline at 01:00 each day for backup and maintenance. When the system goes down at 02:00, no one would be placing orders anyway and they would receive the standard, ‘come back’ later message.

Analysis delegates

Because this is a critical business decision, the RTO and RPO metrics should be decided by Senior management.

The IT Team will be required to confirm what is possible.

To recap

This must be thorough and regularly tested. 

If a backup takes 2 hours, don’t assume the restore will take 2 hours. During a restore, you might need to procure and replace hardware or rebuild a system from scratch before a restore can start. These process can take days if you haven’t envisaged what is required or don’t have access to the hardware. What your server goes offline at 06:00 on Sunday? Can you buy another one before Monday, what about restoring the operating system before you can restore your databases and applications?

Without a thorough restore test, your DR plan will be nothing more than a dream. You really want to be doing test restore rather than live restores. If you are doing alive restore when a system is down, it is fair to say someone has messed up somewhere. EVERY PC and server you install nowadays will have a whole raft of diagnostic and error checking going on. More than 80% of the DR we get involved with are because of faulty hard disks which have been reporting impending errors for days or years. A disk in RAID array fails and the RAID fault tolerance level is managing things nicely until another disk fails and the array dies. We most commonly see RAID, 5,6 and 10 in use. These are expensive systems and can easily be kept healthy by swapping disks during a first-instance failure.

Motherboards and memory failures are different. With memory, we will normally see shoe CRC errors starting before they fail. Regarding the others, try dual power supplies, keeping spare parts or even a spare server. Your RTO and RPO calculations will dictate your budget, i.e. do you keep a spare server, cluster your servers, or have live failover replication between your premises and data centres?

What else should be considered?

Now we have the hardware DR plan written, lets automate things.

Backups can easily be automated and the results should be monitored by a competent member of staff. The action of staff monitoring backups should be logged somewhere so that management know the logs are being checked fully and not just given lip service.

Test restores are very important and I can’t overstate that. It is not uncommon for an IT Admin to move data within a server or to another server when extra storage is required, and this needs to be communicated to the team who are responsible for data backups so the backup sets can be altered.

Some other issues we have seen when a restore is needed are; the encryption password is wrong or has been lost (without these, data cannot be restored), the system architecture is unknown when rebuilding a new server (what disk sizes and partitioning did we use on the server before it crashed?), the media where our data is stored isn’t available (this used to be physical tapes, but nowadays is more likely to be insufficient bandwidth download for a cloud restore). The list goes on and on and on.

The only way to deal with this is through regular restores and benchmarking. With virtualisation and low cost hardware, it is fairly easy to recover an entire server to a restore lab so you can check if your RTO is maintained.

Deliver RTO and RPO in steps

Calculate your RTO and RPO values and have these approved by senior management.

Remember, these are business decisions and not solely IT decisions. Lets assume these are set in stone.

Decide if the RTO and RPO can be achieved with the existing systems and backups
If yes, then test it to prove or fail.

No budget, go to step 1
If not, then identify cracks decide on system changes

Backup to more than 1 DC

Test restores

Identify other risks, power, substations, networking

Disaster prevention

Not really part of this, but much more important.
Hyper V
Data corruption cannot be fixed

BOBcloud.net
The Old Sorting Office, Corsham, Wiltshire SN13 9AA
Tel: 0800 907 8238 https://www.bobcloud.net/wp-content/themes/bobcloud/images/logo.png