30.9.14

OpenMPI jobs hang on Rocks 5.5 cluster

Problem:
There are many, many reasons why an OpenMPI job might fail, or hang during execution, on Rocks clusters.  An oddball one relates to the existence of a virtual network connection, "virbr0" with IP address 192.168.122.1.  It is unclear why this connection exists, but mpirun may use it to try to pass a message between machines, rather than using your real IPs.  The good news is, you can remove it.

Solution:
1. Verify that "virbr0" exists:  /sbin/ifconfig

2. If so, make sure that you don't have other virtual networks that you actually need. There is a pretty good chance the next step will mess them up.

3. Run the following commands on each node:

virsh net-destroy default
virsh net-undefine default
/sbin/service libvirt-bin restart
/sbin/ifconfig

The last command is just to verify that "virbr0" no longer exists.  "virbr0" should not recreate itself when you  reboot the cluster.


No comments:

Post a Comment