I’m new to Dask and Dask Gateway but am trying to see how dask can speed up a flood mapping application. My data (in MATLAB .mat format) are divided into tiles and each tile need to be mapped as a GeoTiff file and all the tile maps will be mosaiced when individual tile mapping is done. I can run the application as a notebook on Microsoft Planetary Computer Hub. Now I’m trying to parallelize the application using the Dask Gateway available in the PC. Below is the custom DAG. My data are stored in the home folder on the PC and it seems to me that gateway workers cannot access those tile files and I suspect they also cannot write the maps. So my question is how can I make my local file system available to gateway workers?
If you use kubernetes as a dask-gateway-server backend type you can mount your files in the workers with configuring main dask-gateway yaml file.
(this question is not specific to dask-gateway, many dask deployments are possible where hte workers cannot see the same filesystem as the client)
It is perfectly possible to send data to workers (e.g.,
client.scatter or but including them in function call arguments), and to collect data from workers (e.g.,
client.gather). This is more or less what you are asking, since there is no way for the workers to directly access the client’s filesystem. However, it’s in general a bad idea unless the data is very small - you end up losing any benefit from parallelism int he cost of the transfer.
You should find out if the HPC system has shared storage that you can use. For example, it is very typical to have NFS available for at least temporary storage. Or maybe you can store to an external storage service such as a cloud object store, if the network security allows it. Since data in and out of an HPC cluster is a thing that people generally have to do, there will be some sort of policy you can find.
Thanks for the responses. I’m using Microsoft’s Planetary Computer (PC) which basically is a JupyterHub plus a Dask Gateway (or a Pangeo on Azure). I think Martin is right that I should not use clinet.scatter and client.gather to share client files (some of the tile files are very large) with workers.
I’m not sure how the PC gateway/cluster is setup and whether there is shared storage between client and workers. I have posted the question on their discussion channel and am waiting for their response. I’m also wondering whether there is an API method that allows a client to “mount” a directory to the gateway so that the workers can access the client files under the directory. Thanks for the help!
No, this would either amount to essentially the same thing as scatter/gather, or (if lower-level) to some sort of minimalist NFS. We do not support this, as it would not be worthwhile compared to real cluster networking solutions.
The opposite can be done, to view the filesystem of a worker as another fsspec implementation. The aim is not to move data around, but to get file-listings.