Wednesday, August 5, 2009

How to remove a compute host forcibly from Grid Engine cluster

Now it may so happen that you want to remove a node from your Sun Grid Engine cluster but the node is no longer accessible. It may be because,
  • You are running a HPC cluster in Amazon EC2 with SGE Service Domain Manager with Cloud Adapter
  • The node simply crashed and you don't have a budget to replace it
  • You may think of any scenario where you want remove a SGE execution host, but the execution host has gone down and is never going to come up.
So here the usual way of removing an execution host from SGE master is not going to work as uninstallation of exec daemon on execution host is not possible. So will keep getting those annoying entries in qstat -f about non accessible execution hosts. Unfortunately SGE doesn't provide a clean mechanism by which you can get rid of such hosts.

But that's not the of the story. You can simply hammer SGE by cleaning the entries of such hosts by following method. Lets assume the host you want to remove is, HOST_GO_AWAY

  1. qconf -dattr hostgroup hostlist HOST_GO_AWAY@cloud_hosts >/dev/null
  2. qconf -dattr hostgroup hostlist HOST_GO_AWAY@allhosts >/dev/null
  3. qconf -dh HOST_GO_AWAY >/dev/null
  4. qconf -ds HOST_GO_AWAY >/dev/null
  5. qconf -de HOST_GO_AWAY >/dev/null
  6. rm -f /opt/sge/default/common/local_conf/HOST_GO_AWAY
That's all!

