SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager designed for high-performance computing (HPC) clusters. It is responsible for allocating resources to users for their jobs, managing job queues, and scheduling tasks across the cluster. SLURM allows users to submit, monitor, and control jobs efficiently. Key features include job prioritization, resource allocation, job dependencies, and the ability to run parallel tasks across multiple nodes. By submitting Slurm job, you will substantially request a session within a particular container with the resources you asked for. There are two modes of submitting jobs to Slurm:
Interactive jobs allow users to request resources and immediately access a command-line session on the allocated compute node(s). This is useful for debugging, testing, or running applications that require user interaction. To start an interactive job, you use the srun
command in the terminal with its input value pairs:
srun --ntasks=<"number of tasks"> \\
--cpus-per-task=<"CPUs per task"> \\
--gpus-per-task=<"GPUs per task"> \\
--mem=<"memory needed"> \\
--time=<"time limit (e.g., HH:MM:SS)"> \\
--partition=<"partition name"> \\
--qos=<"qos name"> \\
--nodelist=<"node name or names"> \\
--job-name=<"job name"> \\
--output=<"output file path"> \\
--error=<"error file path"> \\
--container-image=<"path to container image"> \\
--container-mounts=<"host path1:container path1,host path2:container path2,..."> \\
--container-writable \\
--container-remap-root \\
--container-save=<"path to save container state"> \\
--pty bash
this bring you to the computation node with an open terminal. To be able to attach to this open session via vs code, please look here.
Non-interactive jobs, also known as batch jobs, are the most common way to run jobs on a cluster. These jobs are submitted using the sbatch
command along with a job script that specifies the job's resources, commands to execute, and any other necessary parameters. For instance, a script named my_job.sh
might contain resource requests and commands to run your application. You would submit this script with sbatch my_job.sh
. SLURM then queues the job and runs it when resources become available, without requiring further interaction from the user. An example of a structure of a my_job.sh
can be:
#!/bin/bash
#SBATCH --ntasks=<"number of tasks">
#SBATCH --cpus-per-task=<"CPUs per task">
#SBATCH --gpus-per-task=<"GPUs per task">
#SBATCH --mem=<"memory needed">
#SBATCH --time=<"time limit (e.g., HH:MM:SS)">
#SBATCH --partition=<"partition name">
#SBATCH --qos=<"qos name">
#SBATCH --nodelist=<"node name or names">
#SBATCH --job-name=<"job name">
#SBATCH --output=<"output file path">
#SBATCH --error=<"error file path">
#SBATCH --container-image=<"path to container image">
#SBATCH --container-mounts=<"host path1:container path1,host path2:container path2,...">
#SBATCH --container-writable
#SBATCH --container-remap-root
#SBATCH --container-save=<"path to save container state">
<command to run>
Here’s a description of each input used in these commands:
-ntasks=<"number of tasks">
:
-cpus-per-task=<"CPUs per task">
:
-gpus-per-task=<"GPUs per task">
:
-mem=<"memory needed">
:
1024M
) or GB (e.g., 16G
). Ensuring adequate memory is crucial to prevent the job from being killed due to insufficient resources.-time=<"time limit (e.g., HH:MM:SS)">
:
HH:MM:SS
.-partition=<"partition name">
:
-qos=<"qos name">
:
qos
levels may allow access to more resources or faster scheduling.-nodelist=<"node name or names">
:
-job-name=<"job name">
:
-output=<"output file path">
:
-error=<"error file path">
: