Hi Dask team,
I have a quick question on the dask ecosystem that I am hoping will be an easy answer for one of you.
Is there an existing library or piece of the dask ecosystem for distributing workloads that require multiple gpus from a memory perspective?
The use case more specifically is both training and inference of neural networks who have layers so large that they do not fit into a single gpu’s memory (or at least not a gpu that is easy to come by). I know there is dask-pytorch-ddp, but I’m not quite sure if it would be this package or if there is another, better piece of the dask ecosystem i should be using/aware of.
Thank you so much for your help.
I think what you are looking for is model parallelism, correct? I believe pytorch and Tensorflow have their own cluster/distributed tools to accomplish this. I think tools like kubeflow or flyte that are k8s-based are best for standing up those tools. I primarily use Dask for all the data preparation bits and have struggled to use Dask well for training or inference of Tensorflow models. The memory and parallelism of these packages don’t seem to play too nicely with the Dask ecosystem.
That said, I have successfully “piggy-backed” a Tensorflow cluster using Dask similar to Experiment with Dask and TensorFlow. In Tensorflow, a tf cluster is as simple as setting the cluster env var on each node to reflect the distributed strategy you plan to use. I was able to train with 32 GPU cluster, but Dask was only really used to run a function to set the env var on each node and to setup the cluster with Dask helm charts. In my particular case I was using the data distributed strategy.
What @ljstrnadiii says definitly makes sense!
I don’t think there is any Dask related library for doing PyTorch distributed learning except for dask-pytorch-ddp. The primaliry aim of this library is to use Dask to do Data Parallelism with PyTorch, e.g. the same model is loaded on every GPU and train in parallel on multiple data parts.
According to PyTorch documentation:
if your model is too large to fit on a single GPU, you must use model parallel to split it across multiple GPUs.
DistributedDataParallel works with model parallel;
DataParallel does not at this time. When DDP is combined with model parallel, each DDP process would use model parallel, and all processes collectively would use data parallel
So in theory, you could achieve both, but I don’t think this has ever been tested.
@guillaumeeb thanks so much for the reply. yes so to put it more clearly i’m looking to combine model parallelism and data parallelism using dask-pytorch-ddp so that im running a large, multi-gpu model on different partitions if the data in parallel. each dask worker would have multiple gpus.
i would hate to duplicate work on any of this so if you hear of any of the team combining large models and data distributed with pytorch please let me know
after doing a bit more research and to be more specific, looking for a way to use tensor parallelism Tensor Parallelism - torch.distributed.tensor.parallel — PyTorch 2.1 documentation along with dask-pytorch-ddp to distribute a “sharded” model in which the layers themselves are chunked onto the multiple gpus of the worker.