Might need to set stacksize to unlimited: ulimit -s unlimited (in bash) Assuming openmpi, compile with: nvcc -c mpitranschol.cu -I /home/stg20/include/ -DOMPI_SKIP_MPICXX --host-compilation c; mpicc mpitranschol.o -L/usr/local/cuda/lib -lcudart -DOMPI_SKIP_MPICXX Must have various conditions satisfied on the matrix sizes (see .cu file) and some #define's for the numbers of GPUs set up appropriately. Run with something like: time /home/stg20/bin/mpirun -np 2 a.out > tmp Can also use a hostfile, something like localhost slots=4 and run with time /home/stg20/bin/mpirun -hostfile ../hostfile -np 2 a.out > tmp