Display the current utilization of resources

Idle vs. allocated resources

To display the list of idle nodes:

sinfo --state=idle

CPU nodes

sinfo --Format Partition,NodeList,NodeAI,CPUsState -p cpucourt,cpulong,smp,visu

A=allocated, I=idle, O=other, T=total.

GPU nodes

sinfo -NO "CPUsState:30,Gres:30,GresUsed:30,NodeList:30" -p gpu

Example:

[user@login-hpc ~]# sinfo -NO "CPUsState:30,Gres:30,GresUsed:30,NodeList:30" -p gpu
 CPUS(A/I/O/T)           GRES                                 GRES_USED              NODELIST
 0/32/0/32          gpu:v100:4(S:0-1)            gpu:v100:0(IDX:N/A),mic:0           gpu01
 18/14/0/32         gpu:v100:4(S:0-1)            gpu:v100:1(IDX:3),mic:0             gpu02
 47/5/0/52          gpu:a100:4(S:0-1)            gpu:a100:2(IDX:0-1),mic:0           gpu03

This shows one card is being used on node gpu02 which has 4 V100 cards. gpu01 is idle while 2 A100 GPU cards and 47 CPU cores are being used on gpu03. This means that if you want to use gpu03 for a job that requires 1 GPU card and 10 CPU cores, your job will be pending until at least 5 CPU cores are released.

Current CPU load for every node

sinfo --Format NodeHost,CPUsState,CPUsLoad -p cpucourt,cpulong,smp,gpu,visu

A=allocated, I=idle, O=other, T=total.

The load is normal as long as it is equal or lower than the amount of allocated cores on the node.

Nodes information

Retrieve information about all the nodes and their current load:

scontrol show nodes

Information about a particular node (compute01):

scontrol show node compute01