I have not seen any good examples demonstrating the use of dask arrays with tensorflow. My particular use case is that I have a small-medium model that will train on a gpu worker, but I would like all cpu workers responsible for loading dask arrays into memory, probably applying simple ops, and sending to the single gpu machine. Training datasets are usually < 500gb and I am expecting < 5 epochs.
Why?
- I would like the gpu to be fully utilized by having the data loading and preprocessing be offloaded to non-gpu worker(s).
- I am able to fit the training op on a single gpu worker. Worst case, I select bigger gpus and increase the number of gpus per worker (limited to 8–I haven’t seen gpu nodes with more gpus than that). Basically, if I have a single gpu worker with (8) large gpus responsible for catching data processed on other workers and is primarily focused on model training, I should meet my training time requirements if the network can support the data transfer.
- If I need to scale to more than (8), than I will likely look into tf’s distributed strategy, parameter server approach, etc.
Background:
- We are deploying with helm on GKE
- No special networking options enabled
- I have tried to accomplish this with multiple worker groups and annotations, but I am failing to exceed .33gb/s transfer rate. (I’ll add some code in a bit)
Questions:
- Is this a reasonable task for dask?
- Is anyone doing this? I have a hard time believing I am the first to try this.
- Any rough guesses on the maximum data transfer rates from a group of workers to a single worker?
- Are there any recommended network optimizations to help speed up this data transfer?
- Is there any existing code that profiles data transfer between workers on separate nodes?
- would bypassing dask’s communication with a zmq-based protocol like GitHub - NVlabs/tensorcom help speed up data transfer?