Memory Not Free after map_blocks

Hello,

I’m building a segmentation pipeline using dask.distributed to use multiple cores (20 with each 2 threads → 40 threads in total) for CT images.

At some point I run out of GPU memory of my NVIDIA RTX A5000 with 24GB of VRAM.

What I’ve noticed is that when using dask_array.map_blocks() the GPU memory is not cleared after computation. Or at least not in the ammount I would expect.

I’ve created a minimal working example to reproduce this:

Minimal Working Example

import cupy as cp
import dask.array as da
from dask.distributed import Client
from dask.distributed import LocalCluster

import cucim.skimage.restoration as cskr
import numpy as np

# to monitor things
cluster = LocalCluster(processes=True)
client = Client(cluster)
client

Now I create an array which mimics my data and create a function which does a denoising. My final processThings functions contains more and slower computations on each slice. As you can see I use cucim and cupy to process the stuff on my GPU. Its much faster then on my CPU.


# create a 3D array filled with random numbers
array = da.random.randint(low=0, high=256, size=(2000,1000,1000), 
                          dtype=np.uint8, chunks=(1,1000,1000))

# function which should be executed on each slice
def processThings(block):
    block = cp.asarray(block)
    # Perform Chambolle denoising
    block = cskr.denoise_tv_chambolle(block, weight=0.9, eps=0.1, max_num_iter=50)
    return cp.around(block,0).astype(cp.uint8)

I execute the function using dask_array.map_blocks() to use all of my threads which push a slice from RAM to GPU, there they are processed.

# Perform Chambolle denoising on each block
vol_Denoised = da.map_blocks(processThings, array)
# Compute the result and move it to numpy on RAM
vol_Denoised = vol_Denoised.compute().get()

After completion I compute the stack (monitoring on dashboard looks great). I use .get() to get the data from my GPU back to a numpy array (I have lots of RAM so I can store multiple arrays there, and I want to send them back and forth from RAM to GPU for further processing.

Observations

Now what I observed is that even when I use get() to move it from GPU to CPU the VRAM does not change. Okay, but for that we can use some functions like:

# free GPU momory
del array
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()

Some of the memory gets released. But not all of it. I would expect that the VRAM would be empty like before I processed the stack chunkwise.

Why?

It seems to be a dask related situation.

when I do it sequentially like this

# doing it without dask
array = cp.random.randint(low=0, high=256, size=(2000,1000,1000), dtype=cp.uint8)

# Perform Chambolle denoising on each slice
for i in range(array.shape[0]):
    array[i] = processThings(array[i])

# Convert the result to a NumPy array
vol_Denoised_numpy = array.get()
    
del array

mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()

The VRAM is almost empty after free_all_blocks is called.

What am I not seeing here? And if dask is the problem. What else can I do to multiprocess each slice on the GPU? (By that I mean that each slice gets processed individually but multiple slices at once.)

Thank you in advance

Best

UTOBY

Hi @UTOBY,

Sorry, I’m not expert in GPU, so not sure this will help a lot.

While trying to free the GPU memory, did you apply it on the result of vol_Denoised.compute() in the case of Dask? Deleting this reference precisely?

Also, did you try to use LocalCudaCluster, to have an appropriate number of Workers linked to your GPU? Not sure it will help since you have only one.

How much memory isn’t released?

Hey @guillaumeeb,

Question One

I’m not sure if I understand you correctly or the cupy documentation, but I thougt when using .get() on a cupy.array, the array is cast into a numpy.ndarray and effectively removed from the GPU.

(actually if one creates a cupy array, then use the get() method and then using

pinned_mempool = cp.get_default_pinned_memory_pool()
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()

The GPU Memory is free’ed)

So it seems to me that at some point in using map_blocks() a variable is created, assigned to the GPU (maybe cached somewhere) and after the map_blocks is done doing its thing the memory is not marked as free.

I investigated into chaching a little more and found this:

# get the cache for device n
with cp.cuda.Device(0):
    cache = cp.fft.config.get_plan_cache()
    cache.set_size(0)  # disable the cache

but this doesnt help in that case.

Question Two

I tried using dask_cuda.LocalCudaCluster but since I only have one GPU this would be like executing a normal for loop like I proposed in my original post. I was not able to figure out how I can use my 20 cores to launch multiple function calls to my processThings function.

Question Three

From my available 24 GB of VRAM about 5.5 GB arent released after map_blocks. This is about the size of the array in 32 bit, it seems.

Investigation

I found related topics at SO

If I change the code of my function like this:

def processThings(block):
    block = cp.asarray(block)
    # Perform Chambolle denoising
    block = cskr.denoise_tv_chambolle(block, weight=0.9, eps=0.1, max_num_iter=50)
    array_out = cp.asarray(block).astype(np.uint8)
    del block
    # free memory
    mempool = cp.get_default_memory_pool()
    pinned_mempool = cp.get_default_pinned_memory_pool()
    mempool.free_all_blocks()
    pinned_mempool.free_all_blocks()
    return array_out

Basically creating a new numpy array within the function and deleting the original passed block less memory (only ~2.2 GB) is kept in the GPU.

But this is slowing down the entire processing, due to copy from VRAM to RAM (or cp to np array’s).

Its better then nothing but I really expected that after computation and converting from cupy to numpy the memory would be free again.

Since I want to do more calculations downstream on my GPU, at some point I’m prone to have a full GPU Memory and my segmentations stops with a OOM error.

I hope this helps to clearify a bit on my end.

So long

UTOBY