Slurm Primer
What is Slurm?
Slurm is a job scheduler, which manages resources available in a cluster (i.e.: a set of compute resources shared among multiple clients). When multiple clients want to access powerful but finite set of compute resources at the same time, the Slurm Workload Manager allocates compute resources fairly based on what clients request and what resources are available at a given time.
The Slurm scheduler provides three key functions:
- it allocates access to resources (compute nodes) to users for some duration of time so they can perform work.
- it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes.
- it arbitrates contention for resources by managing a queue of pending jobs.
Slurm supports a variety of job submission techniques. Slurm will match appropriate compute resource based on user resource criteria, such as, CPUs, GPUs and memory.
When our clients connect to Carina On-Premise via SSH, they must submit Slurm jobs so their code will run on the resources they requested. Please take a look at the diagram below for more details:
Tip: Wait times in queue
As a quick rule of thumb, it’s important to keep in mind that the more resources your job requests (CPUs, GPUs, memory, nodes, and time), the longer it may have to wait in queue before it could start.
In other words, accurately requesting resources to match your job’s needs will minimize your wait times.
Components of a Slurm Job
A job consists in two parts: resource requests and job steps.
- Resource requests describe the amount of computing resource (CPUs, GPUs, memory, expected run time, etc.) that the job will need to successfully run.
- Job steps describe tasks that must be executed.
Slurm and GPUs
In order to request GPU resources via Slurm, Slurm features that are specific to the GPU resources available on Carina On-Premise will be required. Read on to see examples of how these features are used when specifying Slurm jobs with GPUs.
GPU Models and Slurm Features
# of Nodes | # of GPUs | Slurm Features |
---|---|---|
6 | 4 | GPU_GEN:PSC,GPU_BRD:TESLA,GPU_SKU:V100_PCIE,GPU_MEM:32GB,GPU_CC:7.0 |
2 | 2 | GPU_GEN:PSC,GPU_BRD:TESLA,GPU_SKU:P100_PCIE,GPU_MEM:16GB,GPU_CC:6.0,CLOUD |
GPU Slurm Feature Descriptions
Slurm Feature | Description |
---|---|
GPU_GEN | GPU generation |
GPU_BRD | GPU brand |
GPU_SKU | GPU model |
GPU_MEM | Amount of GPU memory |
GPU_CC | GPU Compute Capability |
Slurm Interactive Session
It is possible to start an interactive session on a compute node using Slurm. This means that compute resources are requested and obtained so you can type your analytical code one command at a time, and see the outputs. This is useful when testing your code.
When requesting CPU's
To start an interactive session on a compute node, with the default resource requirements (one core for 2 hours), you can run:
$ srun --pty bash
You can then see that you have an active session by entering the following:
$ squeue -u <your-sunetid>
#Example
vmeau@carina-login-1:~$ squeue -u vmeau
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
29613 normal bash vmeau R 4:13 1 carina-4
When requesting GPU's
The following will request resources for 2 GPUs.
$ srun --pty -p gpu --gres=gpu:2 bash
The following flags are required:
Slurm flag | Description |
---|---|
--pty | gives you a pty (console) |
-p gpu or --partition=gpu | select the GPU partition |
--gres=gpu:X | request # of GPUs from 1-4 |
To select GPU model using Slurm feature use the -C flag, for example:
srun --partition=gpu --gres=gpu:1 -C GPU_SKU:V100_PCIE --pty bash
Slurm Batch Scripts
The typical way of creating a job is to write a job submission script. A submission script is a shell script (e.g. a Bash script) whose first comments, if they are prefixed with #SBATCH
, are interpreted by Slurm as parameters describing resource requests and submissions options.
The submission script itself is a job step. Other job steps are created with the srun
command.
When requesting CPUs
For instance, the following script would request one task with one CPU for 10 minutes, along with 2 GB of memory, in the default partition:
#!/bin/bash
#
#SBATCH --job-name=test
#
#SBATCH --time=10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
srun hostname
srun sleep 60
Warning: Slurm directives must be at the top of the script. Slurm will ignore all #SBATCH
directives after the first non-comment line. Always put your #SBATCH
parameters at the top of your batch script.
When started, the job would run a first job step srun hostname
, which will launch the command hostname
on the node on which the requested CPU was allocated. Then, a second job step will start the sleep
command.
When requesting GPUs
The following script will request two GPUs for two hours in the gpu partition, job-name gputest1:
#!/bin/bash
# Give your job a name, so you can recognize it in the queue overview
#SBATCH --job-name=gputest1
# Get email notification when job finishes or fails
#SBATCH --mail-type=END,FAIL # notifications for job done & fail
#SBATCH --mail-user=<sunetid>@stanford.edu
# Define how long you job will run d-hh:mm:ss
#SBATCH --time 02:00:00
# GPU jobs require you to specify partition
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --mem=16G
# Number of tasks
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
You can also reference a gpu slurm feature in you script using the following:
#SBATCH -C GPU_MEM:32GB
#SBATCH -C GPU_SKU:V100_PCIE
You can create this job submission script on Carina On-Premise using a text editor such as nano
or vim
, and save it as submit.sh
.
Job Submission
Once the submission script is written properly, you can submit it to the scheduler with the sbatch
command. Upon success, sbatch
will return the ID it has assigned to the job (the jobid).
$ sbatch submit.sh
Submitted batch job 1377
Check the status of your job
Once submitted, the job enters the queue in the PENDING
state. When resources become available and the job has sufficient priority, an allocation is created for it and it moves to the RUNNING
state. If the job completes correctly, it goes to the COMPLETED
state, otherwise, its state is set to FAILED
.
You’ll be able to check the status of your job and follow its evolution with the squeue -u <your-sunetID>
command:
$ squeue -u vmeau
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1377 normal test vmeau R 0:12 1 slurm-gpu-compute-7t8jf
The Slurm scheduler will automatically create an output file that will contain the result of the commands run in the script file. That output file is named slurm-<jobid>.out
by default, but can be customized via submission options. In the above example, you can list the contents of that output file with the following commands:
$ cat slurm-1377.out
slurm-gpu-compute-7t8jf
Congratulations, you’ve submitted your first batch job on Carina!
Check Overall Utilization
You can quickly see the resources you’re using across Slurm for certain time period. Use the following to see your cpu,mem,and gpu utilization statistics. This example would return with information across the month of November:
$ sreport cluster UserUtilizationByAccount -T GRES/gpu,cpu,Mem Start=2021-11-1T00:00:00
End=2021-11-30T23:59:59 user=<your-SUNetID>
Replace the time period for the Start and End months if you want to change the time range. Also, make sure to also replace <your-SUNetID> with your own Stanford SUNetID.
Check the GPU Utilization of your job
srun --jobid=$RUNNINGJOB --pty bash nvidia-smi
What's next?
Actually, quite a lot. Although you now know how to submit a simple batch job, there are many other options. You can get the complete list of parameters by referring to the sbatch
manual page (man sbatch
).
Our team has also put together more extensive Slurm trainings that will allow you to use Slurm efficiently. Look for our workshop named "Efficient Use of High Performance Computing Resources" to get a more thorough introduction to Slurm.