GPU versus CPU lack of performance

Hello,

I’m wondering why in my example using GPU (5.7s) is ~4 times lower than using CPU (1.5) :

import dask.array as da
import cupy as cp
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster
from datetime import datetime as dtt

N= 10
chunk_unit = 20

if __name__ == "__main__":

    # generate random numbers on cpu
    m = da.random.normal(size=(N*chunk_unit, 1024, 1204), chunks=(chunk_unit, 1024, 1024))

    # with cpu
    with LocalCluster() as cluster, Client(cluster) as client:
    
        # take average
        st = dtt.now()
        c = m.mean(axis=0).compute()
        print("finish CPU", dtt.now() - st, c[0,0], sep='\n')
        
    # with gpu
    with LocalCUDACluster() as cluster, Client(cluster) as client:
    
        # move to gpu useful ?
        n = da.map_blocks(lambda a: cp.asarray(a),
                          m,
                          dtype=float,
                          meta=cp.array([]),
                          )
    
        
        # take average
        st = dtt.now()
        c = n.mean(axis=0).compute()
        print("finish GPU", dtt.now() - st, c[0,0], sep='\n')

Returns :

finish CPU
0:00:01.478294
0.0368461756129715

and

/opt/conda/lib/python3.12/site-packages/dask_cuda/utils.py:171: UserWarning: Cannot get CPU affinity for device with index 0, setting default affinity
  warnings.warn(
finish GPU
0:00:05.684561
0.0368461756129715

I’m working with WSL2 on WS11.
nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77.01              Driver Version: 566.36         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4060 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   32C    P8              3W /   30W |       0MiB /   8188MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

and
nvidia-smi topo -m

	    GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	X 		  		        N/A

What do you suggest ?

Thanks
François

Hi @frenco, welcome to Dask community!

I see several possible explanations here:

  • You are creating the Dask array with Numpy backend, and then moving to GPU with map_blocks. Dask is lazy, so this call actually does nothing until you call compute. So you are timing the creation of the array in standard memory, its move to GPU, and finally the mean.operation. I tend to think that the mean operation is negligible compared to the rest. Possible solutions: persist array into memory first, or use Cupy backend right from the start.
  • Again, your operation is really quick, and data is small. Maybe increasing dataset would make some differences (timing the right thing of course).
  • Finally, we need to know what you are comparing: how many cores does your CPU and LocalCluster use?
  • There a good chance that Numpy or Cupy without Dask would be faster in this example.

1. → Thanks so much for your answer. :+1:
Cupy backend indeed is impressively improving the results.

import dask
import dask.array as da
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster
from datetime import datetime as dtt


N= 1000
chunk_unit = 20

if __name__ == "__main__":

    # with cpu
    with LocalCluster() as cluster, Client(cluster) as client:
    
        # generate random numbers on cpu
        m = da.random.normal(size=(N*chunk_unit, 1024, 1204), chunks=(chunk_unit, 1024, 1024))
        # take average
        st = dtt.now()
        c = m.mean(axis=0).compute()
        print("finish CPU", dtt.now() - st, c[0,0], sep='\n')
        
    # with gpu
    with LocalCUDACluster() as cluster, Client(cluster) as client:
    
        # generate random numbers on cpu
        with dask.config.set({"array.backend": "cupy"}):
            n = da.random.normal(size=(N*chunk_unit, 1024, 1204), chunks=(chunk_unit, 1024, 1024))
        # take average
        st = dtt.now()
        c = n.mean(axis=0).compute()
        print("finish GPU", dtt.now() - st, c[0,0], sep='\n')

I performed the same code with N = 1000 in order to have more calculation and created the cupy dask.array under the wanted array backend.

Here are the results :

finish CPU
0:02:31.694690
-0.0011302793257170585
/opt/conda/lib/python3.12/site-packages/dask_cuda/utils.py:171: UserWarning: Cannot get CPU affinity for device with index 0, setting default affinity
  warnings.warn(
finish GPU
0:00:41.165868
0.0018212489447026847

Now GPU is more than 3 times quicker than CPU.

I booked for my WSL CPU 8 cores (i5-13420H). Here is my .wslconfig :

[wsl2]
memory=12GB          
processors=8          
swap=4GB              

With N = 10, speed is almost the same for both CPU and GPU (2") - that is also a good thing.

2. → Is the gain (x3) in line with what I should expect, considering the material ?
3. → Dataframe.backend = ‘cupy’ seems not to work. Do you have same tip for dataframes ?

Ex:

# -*- coding: utf-8 -*-


import dask
import dask.array as da
from dask import dataframe as ddf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster
from datetime import datetime as dtt
import pandas as pd
import cudf

N= 100
chunk_unit = 20

if __name__ == "__main__":
    df = pd.DataFrame({'NB':range(1024*100*chunk_unit)})
    with LocalCluster() as cluster, Client(cluster) as client:
        # generate ddf on cpu
        m = ddf.from_pandas(df, npartitions=1)
        m = ddf.concat([m]*10*N)
        # take average
        st = dtt.now()
        c = m.mean(axis=0).compute()
        print("finish CPU", dtt.now() - st, c, sep='\n')

    with LocalCUDACluster() as cluster, Client(cluster) as client:   
        # generate ddf on gpu
        m = ddf.from_pandas(df, npartitions=1
                            ).map_partitions(cudf.DataFrame)
        m = ddf.concat([m]*10*N)
        # take average
        st = dtt.now()
        c = m.mean(axis=0).compute()
        print("finish GPU CUDF", dtt.now() - st, c, sep='\n')    

It returns (removing warnings) :

finish CPU
0:00:05.222073
NB    1023999.5
dtype: float64

finish GPU CUDF
0:00:16.863515
NB    1023999.5
dtype: float64

Thanks again