Avoid unnecessary resources reservation
Some commands can help you define an adequate reservation of resources (cores, memory and walltime). The interest is that if few resources are available on the cluster, a job that needs few resources is likely to be run sooner than a job requiring many resources. Moreover, as your project is being allocated a given volume of CPU hours per year, this allows you to become aware if you are wasting resources.
You can use the « Slurm job efficiency report » (seff) that reports on the efficiency of a job’s CPU and memory utilization. Just launch the command below, once the first execution of the job is complete:
seff <JOBID>
The following command will allow you to see what was the memory consumption and the time elapsed for each stage of the job. Thus, you can readjust these parameters for your next executions.
sacct -j <JOBID> --format=jobid,jobname,reqnodes,reqcpus,reqmem,maxrss,averss,elapsed,TotalCPU
If you want to see the information for all your jobs, use this command:
sacct -S <START_DATE> --format=jobid,jobname,reqnodes,reqcpus,reqmem,maxrss,averss,elapsed,cputime,time,start,end -u <USERNAME>
The complete list of columns that can be shown is available here.
How to set the amount of memory to be reserved
To set the amount of memory necessary for your job, the Slurm options available to you are as follows:
--mem=
--mem-per-cpu=
--mem-per-gpu=
By default, the specified value is considered to be in Megabytes, but you can change the unit of measurement by adding after it [K|M|G|T].
I increased the number of cores but performance fails to improve
Although a parallel code execution can save significant time compared to execution on a single core, you may notice that the speed of your code execution does not increase in proportion to the number of IT resources used. Indeed, the sequential (= non-parallelizable) portions of your code are not sensitive to the increase in the number of cores. Thus, depending on your code, from a certain number of resources the execution acceleration will reach its maximum threshold and it will therefore be useless to run this code on more resources. For more information on this subject, see Amdahl’s law.