Author: Matt Ferrari
Co-Founder & Former CTO
While most people associate disaster recovery (DR) with earthquakes, tornadoes and floods, such catastrophic events are rare. The reality is that 71 percent of disasters have more mundane causes: pulling the wrong cable, patching the wrong core switch without a strong change management process, systematic or social engineering intrusions, and general system failures that lead to data loss.
When these events occur, the data center can be down for several days or even weeks. That is a sobering thought, especially when you consider that 24 percent of organizations that suffer an outage of 24 hours or more close within two years, and 68 percent of organizations down for a week or more close within one year.
No disaster recovery plan
With so much at risk, you would expect most healthcare organizations to have some form of DR plan in place. You would be wrong.
In fact, in a survey of healthcare IT (HIT) executives, 69 percent said they did not have a DR plan (such as a second site for encrypted data). The only plan they had was to remain dark until the issue that took the data center down was fixed. Of nearly as much concern is that 78 percent of those organizations that did have a plan said they hadn’t tested it within the past 12 months.
In an industry where so much is riding on data, we clearly have some work to do to spread adoption of disaster recovery best practices. Many HIT executives simply believe implementing DR is too difficult. Others cite stagnant budgets and a lack of manpower. But these issues can be overcome, especially when working with a cloud service provider that offers DR as a Service (DRaaS).
Defining disaster recovery
One obstacle to good DR practices is a loose understanding of what a DR plan is. Some think of it as a binder detailing who to call in an emergency. Others believe creating tape or disk backups or sending SQL system logs offsite is all the DR they need. However, while these types of backups meet certain Health Insurance Portability and Accountability Act (HIPAA) requirements for data retention, they will be of little use in getting a healthcare organization running again if the data center goes down.
Effective DR consists of two elements. One is a core disaster recovery plan that duplicates and stores the raw data in a secondary system—either on premise or offsite in a physical or cloud data center. Unlike backup data, which has to go through a lengthy and complex restore process to be useful, this second data set is ready to be consumed as soon as the servers at the secondary site are spun up. This form of DR is ideal for Tier 3 or Tier 4 applications such as Exchange or Sharepoint.
The second form is Replication (also known as Business Continuity). With Replication, the systems to which the organization’s data is duplicated are already operating in standby mode. Should a disaster be declared, the system is failed over and data quickly becomes available to applications once again. This is the preferred method for Tier 1 applications such as electronic health records (EHRs) that have more aggressive uptime requirements.
RTO and RPO determine disaster recovery needs
The two key measures of a DR plan are recovery time objective (RTO) and recovery point objective (RPO). The organization’s ability to absorb pain in each of these areas generally determines the parameters for each.
RTO is the time in which a business process must be restored after a disaster is declared in order to avoid unacceptable consequences. For an EHR, where patient care and safety is on the line, that may be 15 minutes or less. For the billing system, taking 24 hours before functionality is resumed through DR is often tolerable. The faster a system can be recovered, the more it costs, so it is important to be brutally realistic when prioritizing applications.
RPO is the maximum tolerable period in which data might be lost from an IT service due to a disaster. For example, if the RPO is set at 24 hours, the healthcare organization could lose a day’s worth of data entry in the event of a disaster. That is typical of standard daily backups. The more critical the data, the shorter the RPO needs to be set.
Disaster recovery in practice
Two real-world examples illustrate why having a solid DR plan is so critical.
The first involves the flooding in New York City in 2012 during Hurricane Sandy. An entire data center for the financial services industry went dark when the backup generators in the building’s basement were submerged in water. The data center provider had to physically pull its servers out of the racks and drive them to a data center it didn’t own in another city to restore functionality for its most important customers. Those customers went without their data for 4.5 days, while others were down for a week and a half before main power could be restored. Ask yourself – could your hospital or health system survive without access to data for more than four days?
In contrast, a hospital in Utah built its DR plan to include replication of data to the cloud and quarterly tests to ensure performance. When a huge snowstorm threatened to shut down the power to their local data center, the hospital proactively failed over production to the cloud environment with an RTO and RPO of less than an hour – and with no loss of performance or security. In fact, it was so successful that the hospital is now considering permanently moving all of its production to the cloud.
The advantage of DR as a Service
Once you are convinced of the necessity of a DR plan, it is time to look at how to do it. Several options are available.
You can co-locate your DR infrastructure within your data center. As a rule, this approach is unwise because whatever factors affect your production system are likely to affect the DR system as well.
Alternatively, you can build your own mirror site. This method gives you complete control over the technology but can get expensive, especially if you want to duplicate the full capabilities of your data center. It also means your organization is responsible for maintaining the hardware, software and infrastructure, including all patches and upgrades. In addition, the site must be within a reasonable driving distance so your personnel can go there to spin it up when a disaster is declared—but best practices dictate a DR site should be located in an area that is climatologically different from the primary site. Finally, you have to consider the pipe to get the data from the data center to the DR site. It is difficult to seed a petabyte of data initially or send 100 TB each day through a 1 GB or even a 10 GB pipe.
An increasingly popular and recommended third option is to purchase DR as a Service (DRaaS). The costs for this approach are much lower than a mirror site. You only pay for capacity as needed, and, significantly, DR becomes an operating expense instead of a capital expense. Furthermore, the cloud services provider manages all maintenance, upgrades, HIPAA compliance and security, freeing up your internal staff to stay focused on high-value activities. The cloud provider’s experts are available to give advice on customizing the plan to fit your RTO and RPO requirements. Finally, these experts are already at the DR site; your staff doesn’t have to travel to spin the system up—a key factor in a natural disaster.
One other important consideration when deciding which option to select is whether you can test the DR plan on a regular basis. You want to be sure that if a disaster strikes your DR plan works as envisioned to preserve data and restore application service levels. Working with a DRaaS cloud services provider pushes much of the testing process on the partner while ensuring tests occur on schedule.
Not if, but when
Having a well-designed, thoroughly-tested DR plan is not an option. It is essential for today’s data-driven healthcare organizations. Implementing DRaaS through a healthcare-focused cloud services provider helps make disaster recovery more accessible and affordable to organizations of all sizes.