Optimal way to monitor GPU memory usage during distributed training (XGBoost)

GPU Memory Monitoring During Distributed XGBoost Training
Hello,

I was wondering if anyone had any advice or information they could direct me to regarding ways to monitor GPU memory usage during distributed training using logger.info().

I’ve been trying to implement GPU memory monitoring with PyNVML and XGBoost callbacks during training loops, but I’m encountering an issue: the memory values reported remain exactly the same (down to the byte level) throughout the entire training process.

For example, it outputs values like 7000MB/8000MB available for all four GPUs every 10 iterations, with not a single bit of change even when checking at the byte level.

Any advice on why the GPU memory values remain static during distributed training would be greatly appreciated. Is there a better approach to monitor GPU memory usage in a distributed XGBoost/Dask setup?

Simplified Training Setup

with LocalCUDACluster(n_workers=n_gpus, device_memory_limit="8GB") as cluster:
    with Client(cluster) as client:
        # Convert to Dask arrays
        X_train_da = da.from_array(X_train, chunks=(chunk_size, -1))
        y_train_da = da.from_array(y_train, chunks=chunk_size)
        
        # Create DMatrix
        dtrain = xgb.dask.DaskDMatrix(client, X_train_da, y_train_da)
        dval = xgb.dask.DaskDMatrix(client, X_val_da, y_val_da)
        
        # Train with XGBoost
        output = xgb.dask.train(
            client,
            params,
            dtrain,
            num_boost_round=1000,
            evals=[(dtrain, "train"), (dval, "val")],
            verbose_eval=True
        )

XGBoost Callback Implementation

class GPUMemoryCallback(xgb.callback.TrainingCallback):
    def __init__(self, logger, log_interval=10):
        self.logger = logger
        self.log_interval = log_interval

    def after_iteration(self, model, epoch, evals_log):
        if (epoch + 1) % self.log_interval == 0:
            # This doesn't work well in distributed setting
            log_gpu_memory_usage(self.logger, stage=f"training_iteration_{epoch + 1}")
        return False

Full Training Function with Callback

def fit(self, train_df, val_df, compute_resources):
    # ... preprocessing code ...
    
    try:
        with LocalCUDACluster(n_workers=n_gpus, device_memory_limit="8GB") as cluster:
            with Client(cluster) as client:
                # Create Dask arrays and DMatrix objects
                dtrain = xgb.dask.DaskDMatrix(client, X_train_da, y_train_da)
                dval = xgb.dask.DaskDMatrix(client, X_val_da, y_val_da)
                
                # Prepare callbacks - THIS IS THE KEY PART
                train_callbacks = []
                if self._is_using_gpu():
                    gpu_callback = GPUMemoryCallback(logger, log_interval=10)
                    train_callbacks.append(gpu_callback)
                
                # Train with XGBoost - callback gets serialized and sent to workers
                output = xgb.dask.train(
                    client,
                    self.params,
                    dtrain,
                    num_boost_round=1000,
                    early_stopping_rounds=50,
                    evals=[(dtrain, "train"), (dval, "val")],
                    callbacks=train_callbacks,  # <-- Callback used here
                    verbose_eval=True
                )

GPU Memory Logging Function

def log_gpu_memory_usage(logger, stage):
    gpu_info = {}
    try:
        device_count = nvmlDeviceGetCount()
        gpu_info["gpu_count"] = device_count

        for i in range(device_count):
            handle = nvmlDeviceGetHandleByIndex(i)
            memory_info = nvmlDeviceGetMemoryInfo(handle)
            name = nvmlDeviceGetName(handle).decode("utf-8")

            gpu_info[f"gpu_{i}"] = {
                "name": name,
                "memory_total_mb": memory_info.total / (1024 * 1024),
                "memory_used_mb": memory_info.used / (1024 * 1024),
                "memory_free_mb": memory_info.free / (1024 * 1024),
                "memory_usage_percent": (memory_info.used / memory_info.total) * 100,
            }

Hi @ap213, welcome to Dask Discourse forum!

First, I would like to be sure you are really launching computations on GPUs, are I don’t see any hints of that in your code. Are you configuring something, somewhere, to be sure that the code is running on GPUs? From the code I see, you are creating standard Dask array, so they would be held in server main memory and using CPUs, creating a LocalCUDACluster is not enough, but maybe you just didn’t put some part of the code.

To be more specific, you should use cupy or use it as a backend, as in XGBoost example:

with Client(cluster) as client, dask.config.set({"array.backend": "cupy"}):

Next, or in the meantime, I would also check that GPUs are correctly used by using system tools like nvidia-smi. If you see some usage here, you should be able to get it from Python.

You can also use the Dask dashboard which has gpu support if dask-cuda is installed.

Sorry about that I do have training setup for multiple gpus, I’m just not sure how much I am allowed to expose, and when using watch -n1 nvidia-smi I am able to see slight fluctuations during the training, but pynvml keeps showing the same static vals from the start to the end, it’s not an issue of not updating the v
2025-06-27 15:45:41,881 - src.utils.helpers - INFO - [training_iteration_290] GPU Memory Usage: GPU0: 74.2% | GPU1: 78.6% | GPU2: 73.5% | GPU3: 87.8%
2025-06-27 15:55:30,824 - src.utils.helpers - INFO - [training_iteration_1000] GPU Memory Usage: GPU0: 74.2% | GPU1: 78.6% | GPU2: 73.5% | GPU3: 87.8%

                train_callbacks = []
                if self._is_using_gpu():
                    gpu_callback = GPUMemoryCallback(logger, log_interval=10)
                    train_callbacks.append(gpu_callback)

                # Train using Dask distributed XGBoost
                logger.info("Starting distributed training...")
                output = xgb.dask.train(
                    client,
                    self.params,
                    dtrain,
                    # Other params
                    callbacks=train_callbacks if train_callbacks else None,
                )

I assume that its something to do with NVML caching the values when workers are calling it but I can’t seem to trace the issue as the memory values do change when I start the training (all of them are around 1%) and at the end of training (again around 1% for each gpu).

How to efficiently monitor GPU usage without a dashboard? would this be a good source of info and applicable to my scenario for memory usage tracking using logger.info , or is there anything else that someone could point me to for this scenario?