Introduction

High-Performance Computing (HPC) is an essential tool for modern scientific research, allowing scientists and researchers to perform complex calculations and simulations at an unprecedented scale.

One of the most popular HPC job scheduling systems is SLURM, which stands for Simple Linux Utility for Resource Management.

In this blog post, we will focus on the salloc command, one of the most commonly used SLURM commands for requesting compute resources.

The Salloc command

The salloc command allows users to request compute resources for interactive jobs.

This command reserves resources on the HPC cluster for a certain amount of time and assigns them exclusively to the user. This can be useful for debugging or testing code, or for running jobs that require user interaction or visualizations.

Here’s an example of how to use the salloc command:

salloc -N1 -p dlv -A project_a -t 120

Let’s break down each of these options in more detail:

  • N1: This option specifies the number of nodes requested for the job. In this case, we are requesting only one node.
  • -p dlv: This option specifies the partition on which the job will run. Here, we are using the partition named “dlv”, which corresponds to a specific computing resource such as “2xV100 GPUs”
  • -A project_a: This option specifies the account or project to which the job belongs.
  • -t 120: This option specifies the duration of the job in minutes. In this example, we are requesting resources for 120 minutes.

Once we run the salloc command, we will see an output similar to the following:

The srun command

Another command similar to salloc is srun:

srun -N 1 -p partition_name -t 02:00:00 --pty bash
  • srun: This is the command used to submit parallel jobs for execution.

  • N 1: This option specifies the number of nodes to be allocated for the job. In this case, it’s requesting 1 node.

  • p: The “-p” option is used to specify the partition or queue to which the job should be submitted. In this case, the partition is specified as “partition_name.”

  • t: The “-t” option specifies the maximum time (in HH:MM:SS format) the job is allowed to run. In this case, it’s set to 2 hours.

  • pty bash: This part of the command requests a pseudo-terminal (pty) and launches the “bash” shell. The “–pty” option is used to indicate that the job requires a pseudo-terminal for interactive use, and “bash” is the shell that will be started.

The sbatch command

classic command to run with a batch script

sbach script.sh

where the script looks something like this:

#!/bin/bash
#SBATCH -p gpu
#SBATCH -t 24:00:00

cd /main/working/dir

module load modules/anaconda3/4.3.1

conda activate env

export data_path="/some/data/path"

export HUGGINGFACE_HUB_CACHE=$data_path
export PIP_CACH_DIR=$data_path
export TRANSFORMERS_CACHE=$data_path

python script.py

Accessing Resources

salloc: Granted job allocation 123456
salloc: Waiting for resource configuration
salloc: Nodes cn001 are ready for job

Here’s what this means:

  • The first line indicates that the job allocation has been granted, and it provides a job ID (in this case, 123456)
  • The second line indicates that the system is waiting for the resource configuration to be set up.
  • The third line indicates that the requested node (in this case, cn001) is ready for the job.

At this point, we can log into the allocated node using the ssh command and run our interactive job. Once we are finished, we can exit the node and the resources will be released automatically.

Monitoring Resources

Another helpful command is squeue, which allows users to view the status of their jobs, as well as the jobs of other users on the cluster.

By default, squeue shows all running and pending jobs on the cluster. However, users can use various options to filter the results by user, job ID, partition, or other criteria.

To view the status of your own jobs, you can use the -u option followed by your username. For example:

squeue -u username

This results in an output that may look similar to the one below:

JOBID     PARTITION     NAME         USER        ST   TIME_LEFT  NODES
123456    gpu           job1         johndoe     R    02:30:00   1
123457    gpu           job2         johndoe     PD   00:05:00   2
123458    cpu           job3         johndoe     R    00:45:00   1

Viewing available computes

just type:

sinfo