How to estimate the resources required to run your job

Avoid unnecessary resources reservation

Some commands can help you define an adequate reservation of resources (cores, memory and walltime). The interest is that if few resources are available on the cluster, a job that needs few resources is likely to be run sooner than a job requiring many resources. Moreover, as your project is being allocated a given volume of CPU hours per year, this allows you to become aware if you are wasting resources.

You can use the « Slurm job efficiency report » (seff) that reports on the efficiency of a job’s CPU and memory utilization. Just launch the command below, once the first execution of the job is complete:

seff <JOBID>

This job made a reservation for 2 nodes whereas one node would have been enough (CPU Efficiency lower than 50%)

The following command will allow you to see what was the memory consumption and the time elapsed for each stage of the job. Thus, you can readjust these parameters for your next executions.

sacct -j <JOBID> --format=jobid,jobname,reqnodes,reqcpus,reqmem,maxrss,averss,elapsed,TotalCPU

If you want to see the information for all your jobs, use this command:

sacct -S <START_DATE> --format=jobid,jobname,reqnodes,reqcpus,reqmem,maxrss,averss,elapsed,cputime,time,start,end -u <USERNAME>

The complete list of columns that can be shown is available here.

How to set the amount of memory to be reserved

To set the amount of memory necessary for your job, the Slurm options available to you are as follows:

--mem=
--mem-per-cpu=
--mem-per-gpu=

By default, the specified value is considered to be in Megabytes, but you can change the unit of measurement by adding after it [K|M|G|T].

I increased the number of cores but performance fails to improve

Although a parallel code execution can save significant time compared to execution on a single core, you may notice that the speed of your code execution does not increase in proportion to the number of IT resources used. Indeed, the sequential (= non-parallelizable) portions of your code are not sensitive to the increase in the number of cores. Thus, depending on your code, from a certain number of resources the execution acceleration will reach its maximum threshold and it will therefore be useless to run this code on more resources. For more information on this subject, see Amdahl’s law.