I just experienced my first real Business Continuity Plan (BCP) event. In the past my company has simulated BCP events to test our response and capabilities. The simulated events were always minor, like a fiber cut. These are a couple observations and lessons I learned being involved in support as a Systems Administrator.
1) Your one stupid decision will be brought to light. Two major data center providers had major issues because of one poor decision. They both had located their generator fuel pumps in the basement. Generators in NYC/Manhattan are typically located on the roof due to the cost of real estate. Pumps are used to transport the fuel from the street level up 15-50+ stories. Basements are the first places to flood, thereby making your fuel pumps useless, eventually leading to your generators running out of fuel.
Articles to read:
Flooded NY data centers survive Sandy on generator power, fuel deliveries | Ars Technica
2. Man power is important but tough to guarantee. You need the right people in the right places to keep things running or to fix things that break. But those same people have families and responsibilities. Key people may be busy dealing with more important matters. Business is important but life is paramount.
3. Make sure everyone has remote access and uses it periodically. After Sandy, employees’ needed to work from home because our offices still did not have power. That morning 10% of the company opened tickets requesting help with remote access. This could have been avoided if we had encouraged employees to work from home periodically, thereby ensuring remote access.
4. Have more than two of important infrastructure or have more than one backup. A BCP event causes havok and in that chaos it will turn your redundant services into single points of failure (SPOF). You might have had two VPN servers before you lost power to an office, but afterwards you have a significant SPOF.
5. Be diligent in your BCP testing and preperation. Test those generators, test them again, and test them on real load a third time. You might have designed a system to have redundancy, but you need to be sure they built what you designed. I know many fiber paths that were designed to be redundant, but collapse together at some points (usually the last X feet). Be strict and follow up.