I read this page, and I still have (sub) questions:
In my use case, most of the custom data-analysis code resides in its own package. The primary workflow for single-processor is git clone library, cd library, emacs library, python -m library.main. I am wondering how to adapt this process to Dask. I was thinking of having all of the packages or (maybe just the one package I am interested in editing) to NFS, replicating that on the workers, and putting that into their $PYTHONPATH.
- Will the workers see the updated package, when the client process is restarted? I.e., if I edit a package called
x, run a script that {starts a Dask client, imports a delayed function fromx, and runs it}, will the workers use the updated copy ofx? - Are the consistency guarantees of NFS strong enough such that if a developer updates the source code, it will change on all of the workers?
- How do I tell
dask-workerto pass-X pycache_prefix=/path/outside/nfsto Python (also, is this even necessary)? - Will using NFS to store Python package source code (not data) cause a performance hit?
- This could all be avoided if it were possible to tell “treat x module as unimportable like main; serialize it when it is referenced”. Yes it will take a long time, but this is a package that is constantly changing, so 80% of the times, the package would have to be sent just before runtime anyway. Is there such an option?
The containerization/packaging methods seem like you have to redeploy the cluster (i.e., docker stop dask-worker, docker rm dask-worker, docker run dask-worker:new-image on every node), which makes sense if updating the package set is rare (not my use case). Then client.upload_file and distributed.diagnostics.plugin.PipInstall seem too ad hoc, but maybe I could put a convincing wrapper around it that hides everything?
Any answer, even to a subset of the questions, is much appreciated. If there is a consensus that explaining the operation of NFS clusters is worthwhile, I will add them to this page.