There are many, many reasons why an OpenMPI job might fail, or hang during execution, on Rocks clusters. An oddball one relates to the existence of a virtual network connection, "virbr0" with IP address 192.168.122.1. It is unclear why this connection exists, but mpirun may use it to try to pass a message between machines, rather than using your real IPs. The good news is, you can remove it.
Solution:
1. Verify that "virbr0" exists: /sbin/ifconfig
2. If so, make sure that you don't have other virtual networks that you actually need. There is a pretty good chance the next step will mess them up.
3. Run the following commands on each node:
virsh net-destroy default
virsh net-undefine default
/sbin/service libvirt-bin restart
/sbin/ifconfig
The last command is just to verify that "virbr0" no longer exists. "virbr0" should not recreate itself when you reboot the cluster.
 
No comments:
Post a Comment