Automation and Cascading Changes

November 9th, 2006

I was with a CIO for a large public utility, where several of the IT systems control real world infrastructure. He made a very interesting point — universally what people in IT are doing to reduce cost is to automate all the manual tasks. While this seems the correct way of doing things, one of the big dangers of this is cascading changes is causing cascading failures.

Pre-automation there were several manual steps which inherently created barriers for the failures to cascade and also in some sense partitioned the infrastructure. Several examples came out

  • Active Directory/DNS: since the directory auto-replicates, if you make a mistake it propogates relatively quickly
  • Production and Disaster Recovery: auto-sync between these two can bring down both
  • Network: this is the classic because routing changes propogate quickly
  • Any clustering solution

We had an interesting discussion about what to do in this case. Clearly you want automation, introducing a human in the loop is not an option in many cases. But how do you solve this problem?

One of our colleagues at a large minufacturing facility had solved this problem in an interesting way. He took one node in a cluster or active directory and used to keep it disconnected!!! and the manually connect it once in a while.

That was very interesting because he had figured out a way to technically enforce a change window which opened by him connecting and dis-connecting to the network.

Entry Filed under: Change Control

Leave a Comment

You must be logged in to post a comment.

Trackback this post  |  Subscribe to the comments via RSS Feed


Calendar

November 2006
M T W T F S S
« Oct   Dec »
 12345
6789101112
13141516171819
20212223242526
27282930  

Most Recent Posts