[Home] [Blog] [Contact] - [Talks] [Bio] [Customers]
twitter linkedin youtube github rss

Patrick Debois

Disaster Recovery Planning by using Post-IT's

Hmm. You might wonder how Disaster Recovery relates to Post-IT's. Well if you ever discussed a DRP project, one often starts with a technical overview of all the systems involved. This helps you to identify f.i. single point of failures (SPOF) and other relations that were not clear before. In complex systems or in origanizations with a lot of systems and interactions, not one single person has a global overview.
Creating this overview is crucial in the understanding but making a sketch or drawing on a flip-chart it often limited:
  • if something is drawn on paper it is difficult to rework later
  • rearranging the components it often cumbersome because the whole board needs to shift or erased for rework
Therefore I proposed a client to make such a  drawing using Post-IT's and sticking them on the wall. This allowed us to use the entire wall of the meeting room, instead of the limited space a board or chart provides.

After some discussion we came up with: (similar to UML deployment diagrams)

Physical component:
  • f.i. Sun V440, Compaq ..., Cisco 15000, Nokia Firewall, D-Link Switch, Alteon Loadbalancer
Software component:
  • f.i. Apache web server, Sun LDAP server, Oracle  Database,Monitoring Agent, SSH Agent, Backup Agent 

Please note that when discussing often the following is forgotten but should be included
  • External dependencies that the application uses but the team does not manage (Mail Server, DNS Server, Internet Connection...)
  • Identity all  systems involved: including the systems needed for data replications, management (remote control), VPN for support, Backup
On each postit you can then mark:
  • If the applications  are spread over sites , use some tape to group the components and label the locations
  • Mark them as production or management
  • Mark them as internally or externally managed
  • If data is involved mark it (master, slave)
  • Note special security precautions
  • Capacity (# of users f.i. for webserver) or (# of CPU for hardware)
  • Availabilty: check if failover/failback mechanism works if you take out one postit
  • Data storage (local, SAN, ...)
  • Check how it is monitored
  • .... and of course many more

Our wall kept on growing after a few days, and new things came up and everybody learned. We brought in different groups so that they could complete the wall. Unfortuneatly I could not post the results here (confidential), but it would be nice to see how it worked for you.