bwUniCluster

Getting access

Accessing bwUniCluster is a multi-step procedure that is described in detail here: https://wiki.bwhpc.de/e/Registration/bwUniCluster. Please follow these steps as described. When being asked why you need access, please mention the “HPC with C++ class at the University of Freiburg”.

Logging in and transferring data

You can the login to bwUniCluster with the ssh (secure shell) command:

ssh -Y <username>@uc2.scc.kit.edu

The option -Y tells ssh to forward X11 connections. This allows you to display plots while remotely logged in to bwUniCluster. This will not work if you are logging in from a Windows machine, unless you have set up a dedicated X11 server on your windows computer.

You can copy files to bwUniCluster with the scp (secure copy) command:

scp my_interesting_file <username>@uc2.scc.kit.edu:

Don’t forget the colon : after the machine name when running the scp command. Note that you can use the same command to get the data back from bwUniCluster. For example, to copy the file result.npy from your home directory on bwUniCluster to the current directory simply execute (on your local machine):

scp <username>@uc2.scc.kit.edu:result.npy .

The dot . refers to your current directory. You can also specify full paths on either remote or local machines when executing scp.

Setting up the software environment

After logging in to bwUniCluster at uc2.scc.kit.edu, you will need to set up your local environment to be able to compile and run parallel applications. The module command loads a specific environment:

You can see all possible modules with module avail. Note that the list changes when you load a module as there are dependencies.
You search for a specific module with module spider.
You can list the presently load modules with module list.
You can remove (unload) all presently load modules with module purge.

Please load the following modules (just execute these commands at the command line):

module load compiler/gnu mpi/openmpi devel/python/3.12.3_gnu_13.3

You can check whether the correct modules where loaded by executing

module list

Now compile you need to install muFFT. First compile muFFT’s dependencies. This is described in the Getting Started page of muFFT. In short, execute

curl -sSL https://raw.githubusercontent.com/muSpectre/muFFT/main/install_dependencies.sh | sh

Now setup a virtual python environment to install muFFT. You need to execute this in your home directory, which is where the venv directory will be located:

python3 -m venv venv

You activate the virtual environment with:

source venv/bin/activate

Upgrade pip in your venv because older versions of pip do not play well with meson which is used for building muFFT:

python3 -m pip install --upgrade pip

You also need to manually install mpi4py:

python3 -m pip install --force-reinstall --no-cache --no-binary mpi4py mpi4py

Now install muFFT:

PKG_CONFIG_PATH=$HOME/.local/lib/pkgconfig:$PKG_CONFIG_PATH \
    LIBRARY_PATH=$HOME/.local/lib:$LIBRARY_PATH \
    CPATH=$HOME/.local/include:$CPATH \
    pip install -v \
        --force-reinstall --no-cache \
        --no-binary muGrid --no-binary muFFT \
        muGrid muFFT

Running simulations

bwUniCluster has extensive documentation that can be found here: https://www.bwhpc-c5.de/wiki/index.php/Category:BwUniCluster. Simulations are typically run as batch jobs. They have to be submitted through a batch or queueing system that takes care of assigning the actual hardware (compute node) to your job. A description of the queueing system can be found here. Please make sure you understand the concept of a partition described here.

Run simulations within a dedicated workspace. bwUniCluster provides a parallel file system for such workspaces. Create a workspace with

ws_allocate spectral_methods 30

where the number (30) is the lifetime of the workspace. This workspace will be deleted after 30 days. You can list all your workspaces with ws_list and extend their lifetime with ws_extend.

IMPORTANT: never ever run a simulation on the front/login node (i.e. without submitting a job using the sbatch command, see below). Also do not run simulations in your home directory. Use a workspace. Your access to bwUniCluster could be revoked if you do this.

Job scripts

To run you job, you need to write a job script. The job script is executed by the bash command which is the shell that you are using on Linux systems. (See the first tutorial here for an introduction to the shell.) The job script specifies how many CPUs you require for your job, how to set up the software environment and how to execute your job. An example job script (you can use this one almost as is) looks like this:

#!/bin/bash -x
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=40
#SBATCH --time=00:40:00
#SBATCH -J SpectralMethods
#SBATCH --mem=6gb
#SBATCH --export=ALL
#SBATCH --partition=multiple

module load compiler/gnu mpi/openmpi devel/python/3.12.3_gnu_13.3
source ${HOME}/venv/bin/activate

echo "Running on ${SLURM_JOB_NUM_NODES} nodes with ${SLURM_JOB_CPUS_PER_NODE} cores each."
echo "Each node has ${SLURM_MEM_PER_NODE} of memory allocated to this job."
time mpirun python3 my_python_program.py

Lines starting with a # are ignored by the shell. Lines starting with #SBATCH are ignored by the shell but are treated like command-line options to the sbatch command that you need to submit your job. You can see all options but running man sbatch.

The --nodes option specifies how many nodes you want to use. --ntasks-per-node specifies how many processors per node you need. The --time option specifies how long your job is allowed to run. You job is terminated if it exceeds this time! The -J option is simply the name of your job. It determines the names of output files generated by the batch system. Check the documentation linked above to understand the other options.

The time command in front of mpirun measures the execution time of the code and prints it to screen. Note that you do not tell mpirun how many cores to use. The batch system passes this information on automatically.

Submitting jobs

Assume the filename of the above script is run.job, you can submit this script with

sbatch run.job

Everything that is included via #SBATCH in the script above can also be specified on the command line. For example, to change the number of cores to 320 you can issue the command:

sbatch --nodes=4 --ntasks-per-node=80 run.job

Your job may need to wait for resources and will not run immediately. You can inquire the status of your job with

squeue

Once it ran, you will find a file that start with job_ and has the extension .out. This file contains the output of the simulation, i.e. the output that is normally written to screen when you run from the command line. Note that bash -x (the first line in the job submission script above) instructs bash to print every command that is executed onto the screen. This makes debugging (of the job script, not your simulation code) easier.

squeue shows the JOBID of you job. You cancel a job by executing

scancel <JOBID>

More information on a certain job can be obtained by

scontrol show job <JOBID>