Idle vs. allocated resources
To display the list of idle nodes:
sinfo --state=idle
CPU nodes
sinfo --Format Partition,NodeList,NodeAI,CPUsState -p cpucourt,cpulong,smp,visu
A=allocated, I=idle, O=other, T=total.
GPU nodes
sinfo -NO "CPUsState:30,Gres:30,GresUsed:30,NodeList:30" -p gpu
Example:
[user@login-hpc ~]# sinfo -NO "CPUsState:30,Gres:30,GresUsed:30,NodeList:30" -p gpu CPUS(A/I/O/T) GRES GRES_USED NODELIST 0/32/0/32 gpu:v100:4(S:0-1) gpu:v100:0(IDX:N/A),mic:0 gpu01 18/14/0/32 gpu:v100:4(S:0-1) gpu:v100:1(IDX:3),mic:0 gpu02 47/5/0/52 gpu:a100:4(S:0-1) gpu:a100:2(IDX:0-1),mic:0 gpu03
This shows one card is being used on node gpu02 which has 4 V100 cards. gpu01 is idle while 2 A100 GPU cards and 47 CPU cores are being used on gpu03. This means that if you want to use gpu03 for a job that requires 1 GPU card and 10 CPU cores, your job will be pending until at least 5 CPU cores are released.
Current CPU load for every node
sinfo --Format NodeHost,CPUsState,CPUsLoad -p cpucourt,cpulong,smp,gpu,visu
A=allocated, I=idle, O=other, T=total.
The load is normal as long as it is equal or lower than the amount of allocated cores on the node.
Nodes information
Retrieve information about all the nodes and their current load:
scontrol show nodes
Information about a particular node (compute01):
scontrol show node compute01