Hello,
I’m building a segmentation pipeline using dask.distributed to use multiple cores (20 with each 2 threads → 40 threads in total) for CT images.
At some point I run out of GPU memory of my NVIDIA RTX A5000 with 24GB of VRAM.
What I’ve noticed is that when using dask_array.map_blocks()
the GPU memory is not cleared after computation. Or at least not in the ammount I would expect.
I’ve created a minimal working example to reproduce this:
Minimal Working Example
import cupy as cp
import dask.array as da
from dask.distributed import Client
from dask.distributed import LocalCluster
import cucim.skimage.restoration as cskr
import numpy as np
# to monitor things
cluster = LocalCluster(processes=True)
client = Client(cluster)
client
Now I create an array which mimics my data and create a function which does a denoising. My final processThings
functions contains more and slower computations on each slice. As you can see I use cucim
and cupy
to process the stuff on my GPU. Its much faster then on my CPU.
# create a 3D array filled with random numbers
array = da.random.randint(low=0, high=256, size=(2000,1000,1000),
dtype=np.uint8, chunks=(1,1000,1000))
# function which should be executed on each slice
def processThings(block):
block = cp.asarray(block)
# Perform Chambolle denoising
block = cskr.denoise_tv_chambolle(block, weight=0.9, eps=0.1, max_num_iter=50)
return cp.around(block,0).astype(cp.uint8)
I execute the function using dask_array.map_blocks()
to use all of my threads which push a slice from RAM to GPU, there they are processed.
# Perform Chambolle denoising on each block
vol_Denoised = da.map_blocks(processThings, array)
# Compute the result and move it to numpy on RAM
vol_Denoised = vol_Denoised.compute().get()
After completion I compute the stack (monitoring on dashboard looks great). I use .get()
to get the data from my GPU back to a numpy array (I have lots of RAM so I can store multiple arrays there, and I want to send them back and forth from RAM to GPU for further processing.
Observations
Now what I observed is that even when I use get()
to move it from GPU to CPU the VRAM does not change. Okay, but for that we can use some functions like:
# free GPU momory
del array
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()
Some of the memory gets released. But not all of it. I would expect that the VRAM would be empty like before I processed the stack chunkwise.
Why?
It seems to be a dask related situation.
when I do it sequentially like this
# doing it without dask
array = cp.random.randint(low=0, high=256, size=(2000,1000,1000), dtype=cp.uint8)
# Perform Chambolle denoising on each slice
for i in range(array.shape[0]):
array[i] = processThings(array[i])
# Convert the result to a NumPy array
vol_Denoised_numpy = array.get()
del array
mempool = cp.get_default_memory_pool()
pinned_mempool = cp.get_default_pinned_memory_pool()
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()
The VRAM is almost empty after free_all_blocks
is called.
What am I not seeing here? And if dask is the problem. What else can I do to multiprocess each slice on the GPU? (By that I mean that each slice gets processed individually but multiple slices at once.)
Thank you in advance
Best
UTOBY