As we wrap up the year and plan our long lazy holidays on the beach or next to the pool (or for the north, next to a fireplace), we need to make sure our systems survive the holidays. How ready is your system, should a disruptive event occur, and are you out of reach? Two metrics that guide data backup strategy and recovery plans are the recovery time objective and recovery point objective also known as RTOs and RPOs.

Recovery Point Objective (RPO)
Recovery point objective is used when defining a data backup strategy. There is an inverse relationship between RPO and cost. A short RPO means data must not age much and frequent backups need to be taken. Continuous data streaming to a secondary site might be required. Depending on volume, a faster network and bigger pipe may be required to replicate the data which can increase cost.

 

Recovery Time Objective (RTO)
A recovery time objective is the time taken from when a disruptive event occurs which affects a system until the system is recovered and fully operational again. Like RPO, there is an inverse relationship between RTO and cost. The shorter the RTO the more costly the recovery process becomes as additional infrastructure (e.g. servers, and storage) and automation (time effort and skill) is required. In addition, standby personnel might have to be remunerated on an ongoing basis.

What teams often overlook is that the RTO timer starts as soon as the disruption occurs, not when the issue is detected. For example, if an RTO is set to 11 hours and a fault happens at 22:00, but the team only discovers it at 07:00 the next morning, 9 hours of the RTO have already elapsed. This leaves only 2 hours (until 09:00) to fix the issue, not 18:00 (11 hours after detection). Without sufficient monitoring and alerting systems to detect faults immediately, meeting the RTO can become impossible, especially if critical repairs like replacing hardware are needed.

It is important to note that the urgency of recovery depends on your business needs. In a system where stock gets enriched on daily basis (e.g. taking photos) or a stock take is performed weekly, a daily backup may suffice, whereas a busy e-commerce website would probably want very little data loss.

To ensure your systems survive the vacation period, herewith some next steps you can take to review your backup and recovery processes:

1. Review (or implement) monitoring and alerting tools

  • Implement monitoring tools that provide real-time system health checks.
  • Configure alerts to notify the right people via multiple channels (e.g. SMS, email) when a health check fails, or an outage is detected.

2. Test your backup and recovery plan

  • Schedule and perform a dry run of your recovery process.
  • Ensure your recovery process can meet the RTOs and RPOs and identify any gaps.
  • If no RTOs and RPOs are defined, have a talk to the customer to bed this down. If not this year, prioritise in the new year. Many customers won’t know what RPOs or RTOs are, take suggestions with and guide the customer on how to define it.

3. Update documentation

  • Document escalation processes, contact lists, and system dependencies.
  • Ensure all team members have access to up-to-date recovery plans.

4. Build in redundancy

  • If hosting in a cloud such as Azure or AWS, make use of their Disaster Recovery solutions and setup a DR site in another geographic location such as Europe.

5. Plan and communicate

  • Plan an on-call schedule to maintain system availability as per customer expectation, e.g. office hours or 24/7.
  • Communicate the plan to all stakeholders and ensure the team members on standby are sufficiently trained to handle emergencies.

With these steps, you can enjoy your holiday knowing your systems are prepared for the unexpected.

Iaan Roux – Diving into the depths of knowledge to retrieve forgotten wisdom.