Question about nthreads, workers, and chunk

Hi, I’m wondering in Dask, what is the purpose of nthreads? As far as I know, each chunk is distributed to a worker for computation, both in cases of CPU and GPU.
Considering this case:

import dask.array.linalg as dal
rs = da.random.RandomState(RandomState=cupy.random.RandomState if device == "gpu" else np.random.RandomState)
a = rs.random(size=(1000000, 1000), chunks=(10000, 1000)).persist()

When using UCX over InfiniBand, running on 2 nodes, each with 1 V100, setting --nthreads 1 for dask-worker, it takes about 20s to finish; whereas setting --nthreads 2, it takes about 12s to finish.
I’m curious when enabling GPU for computation, what is the point of nthreads since the data is initialized on GPU and I’m not moving the final result back to CPU? Why is there a difference in the above case when changing nthreads?

What is happening here is that when you say --nthreads 2, each worker is launched with two threads that can run computations and submit kernels to the GPU. As long as the resulting memory usage is not too great, this can work fine (in the same way that running multiple threads on the same CPU work is also fine).

This will not always be faster, and may sometimes be slower, or lead to more out-of-memory errors that using just a single thread and one GPU per worker.