In this article I list the various metrics/alerts one should have when monitoring a GPU cluster to ensure efficient usage.
- Allocated GPUs are used
- Used to detect jobs that may ask multiple GPUs but end up using 1 or only a few of them
- GPU utilization below threshold (e.g., 10%)
- Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
- GPU utilization above threshold (90%)
- Used to detect when the GPU is saturated
- GPU utilization range
- Used to detect uneven distribution of GPU compute workload
- GPU memory utilization above threshold (e.g., 10%)
- Used to detect workloads that do not make full use of the GPU or are allocated to an oversized GPU
- GPU memory utilization above threshold (e.g., 95%)
- Used to detect when a job is about to run out of GPU memory
- If using InfiniBand
- InfiniBand receive/transmit > 0 when running multi-node workloads
- Used to identify workloads that are not properly configured to use InfiniBand
If the behavior is chronic and unproductive, decide how much engagement is worthwhile.
Sometimes planting a seed is better than trying to change their mind in the moment.
Some people are resistant to new perspectives. If they refuse to engage, focus on managing your own reaction rather than changing theirs.
Choose your battles: Not every conversation is worth having. Consider whether it's worth investing time and energy into trying to change someone's mind.
- Model context protocol (MCP)
- dstack
- Nebius
- Multi-node training and inference
- DeepSpeed
- CometML
- How to use LLM more in my development process
- ML model profiling and optimization
- GCP GCS performance profiling
- NVIDIA MPS
- NVIDIA MIG
- NVIDIA time-slicing
- NVIDIA vGPU
- NVIDIA KAI scheduler
Here is the list of python tools and libraries I use regularly.
(Sorted alphabetically)