Massive Outage

An anatomy:

We had a simultaneous UPS failure at 2:33PM on the 22nd (which I am going to call H hour). This caused the ERP production cluster and storage to go down hard. Cause of the UPS outage is at the moment unknown. They seem to have been out for a little under a second. The effect was – how can I put this – bad.

Firstly, I should mention that we have had fairly significant outside assistance in setting up our production hardware. Everything is in a redundant configuration. Battery backed up disk controllers, RAID up the wazoo. A nice setup that is – or should be – resilient to failures.

Unless of course it has been mis-configured.

A little history. About 8 months ago (6 months before I joined the company) there was a disk outage on the storage cluster. Not a problem, except that the hardware support vendor (who I’m not going to name as we haven’t decided on our litigation strategy yet) could not source appropriate replacements as the storage cluster is pretty old. Instead of replacing the single drive that had failed, they replaced one of the controllers and the set of disks on that controller. They then re-setup the RAID 0+1 configuration across the old and new controller.

This took them about 15 hours to do and caused a significant outage. As you may imagine, this cost them (and us) a large amount of money. Us in lost sales and them in contract penalties. I would imagine that the on-site engineers were under a fair amount of pressure. And probably because of this they missed something.

The new controller’s battery was not installed. And the monitoring system wasn’t updated to register a battery failure on the new controller type…

So when the power outage occurred, one half of the RAID 0 had the cache written to disk and the other didn’t. And this continued across the RAID 1. And the result is that about half our filesystems wouldn’t even mount. Enormous corruption in the database tables that were open.

It took us about five hours to work this out – mostly because the build documentation was not up to date. We’d spent a lot of time trying to understand the source of the corruption and to fix it with volume management tools and (eventually) fsck and related tools. It was about three hours in that we decided that we should simultaneously start work on a recovery plan that did not include data from the un-mountable drives. Unfortunately, the limited resources available meant that we had to exhaust the easy options first.

Our recovery scenario therefore was fourfold:

  1. Install a battery into the controller and update the monitoring system
  2. Continue to try to recover the un-mountable volumes
  3. Recover the previous days backup onto the standby storage
  4. Work out if we could recover the transactions between the last backup and the failure

A little background. We have three major types of transactions:

  1. Customer management (add, change, delete)
  2. Order management (add, pick, ship)
  3. Inventory management (bulk movement, allocate stock to order)

We first set out to identify what information was critical to recovery. We decided to ignore customer changes and deletions, as the number would be small and not business critical.

Bulk movements have physical implications (pallets) so we had already asked Operations to start identifying these. And the carrier had not yet turned up to collect the 1,400 or so orders that had been picked on the day. By H+1 we had asked Operations to send the carrier away empty in case we needed to use the physical orders as an information store.

Then we tried to identify the data that had survived. Due to an idiosyncrasy of the storage design, it was more efficient to have multiple mount points and segment our data across these. This meant that by losing about half the file systems, we still had many of our critical files. Unfortunately not the primary orders file, but we did have a transaction audit file and an inventory file. We also had a payments log file for all credit card transactions. Our customer file was also gone.

So, by H+6 we had a plan. The development team started to write code to identify orders that were in the current transaction logs but not in the restored transaction logs. Also to identify stock movements and orphaned payments. Unfortunately the payment transaction did not have the customer attached to it, but we could make an educated guess by comparing payment amounts, times and credit card names. We could also link inventory movements (stock allocation) to the times of order entry. So by H+11 we had a suite of order recovery tools ready to test.

We trashed the orders in the order file in our QA system for June 20 and started to test the recovery process. Unfortunately this was not a good test as we could not (at least without recovering the trashed files from backup) ensure that the processes were consistent. However it gave the QA team some hints as to what to look for when the real data was available.

Unfortunately the data recovery was going slowly. Our tapes are stored off-site and the security company had problems sourcing keys (arg!). By H+9 our recovery started and we began the process of restoring to offline storage. The restore took 4 hours. Then there was a further 2 hours of replaying transaction logs to get the offline storage to a consistent state to start our testing.

Testing began at about H+15, and we had the usual bugs. While these were being identified, the QA team started to look at the resurrected orders to see if they looked reasonable. This was a critical test as if the approach that we had used was invalid then we would have no choice but to discard the days trading.

Sunrise was about 7AM (H+16). A misty morning.

H+17 Fortunately, the results from QA looked consistent. The development team had a new release, and while this was being tested, we trashed the current production storage, inserted the battery and rebuilt the RAIDs.

H+19 we had moved the data from offline storage to online storage and started recovery.

Our DR plan (which I was in the process of re-writing as we were also in the middle of a SOX audit), mandated a communications plan that had been put in place at the outset. We had company wide updates each hour and the president and cfo were woken at regular intervals (they probably didn’t need to be, but they hadn’t yet signed off on the new DR plan). By H+18, we had most of the company at work, so we also had PA announcements on progress. A lot of staff were diverted to customer service lines to handle the increased load.

H+20 the production system is rebooted and all partitions are visible. Disable user logins to the system to avoid any stupidity.

H+21 our data updates were complete and we had a lit of problematic payments and orders. That is, payments that we could not match with orders. We also had a set of orders for which there were no customers (new customers that must have been created on the day). We then set off a data consistency check prior to our final data manipulations.

H+23. Everything is set to go back into production. We incremented the order and customer table’s primary indexes to leave room for any orders or customers that we are able to manually recreate. We also gave the list of orphaned orders (338) and the list of customer referrals (78) (due to our business model, we generally know who a new customer was referred by as there is a commission system and the payment of the commission is audited and therefore we know who referred a new customer even if we don’t know who the new customer is). Customer service then began the process of ringing customers and finding out who they had referred.

We also printed the list of orders for which we could not identify payments and passed that to Customer service to verify (27).

We also printed a list of the orders that we had successfully rebuilt (1,283) and asked Operations to go to their picked list of orders and identify which had been processed already (to avoid double shipping). Operations found about 200 rebuilt orders that had already been picked. These picked rebuilt orders were also used to give us confidence that the item lists for an order were correct, as we also had Operations do spot checks on 30 of them at random.

We also gave operations the list of orphaned orders – as this would give us customer details that we could use.

Operations were then able to identify the orders that were left over – by definition, these were orders from previous days. We could bulk update these order to status ‘picked’ and ship them (we’d lost the changes to order status as they’re on the orders file).

We then briefed each of the warehouse supervisors (5 warehouses) on all the extra checking procedures that they would need to ensure that orders were not double shipped. 

H+25 The restored ERP back into production.

H+25.1 sleep.

Results after 1 week

  1. We have two orphaned orders. Customer service have made 823 calls.
  2. We have about 80 double shipped orders that we can identify. All customers either called us or were contacted. Most have been returned.
  3. No clue about the UPS. Have scheduled an outage for a UPS engineer to check both devices.

What we would do differently:

  1. Get more business involvement. Particularly from operations as they ended up having nothing to do for the first 20 hours and then incredibly rushed for the next 10. Also this would have freed up ICT resources as the local operations leadership could have briefed the remote warehouse management.
  2. Our outsourced support teams have been allowed to be slack. SLAs were not met by Technical support, Tape Security service or Vendor support. We need to ensure that these services are kept to account and regularly tested.
  3. Do more scenario testing. Play ‘what if’ games with the team. We started the process with a very small team. Had we begun with the entire team we may have been able to do more in parallel.
  4. We’re short in broad technical personnel
What worked well:
  1. We were measured in our approach and planned consistently. Every change was part of a procedure. no change was made until the procedure was documented and understood by all participants. All procedures had a rollback.
  2. Our communication to the company was consistent and proactive. There was a lot of timely information sharing and we did not over promise. The business was able to plan because we were able to give 2 hourly forecasts.
  3. We decided early on to work for a high quality outcome and not to create more problems for ourselves. We agreed that the shibboleth for the recovery was data integrity. Not speed.
  4. The data was restored or recreated consistently