I read this page, and I still have (sub) questions:
In my use case, most of the custom data-analysis code resides in its own package. The primary workflow for single-processor is git clone library
, cd library
, emacs library
, python -m library.main
. I am wondering how to adapt this process to Dask. I was thinking of having all of the packages or (maybe just the one package I am interested in editing) to NFS, replicating that on the workers, and putting that into their $PYTHONPATH
.
- Will the workers see the updated package, when the client process is restarted? I.e., if I edit a package called
x
, run a script that {starts a Dask client, imports a delayed function fromx
, and runs it}, will the workers use the updated copy ofx
? - Are the consistency guarantees of NFS strong enough such that if a developer updates the source code, it will change on all of the workers?
- How do I tell
dask-worker
to pass-X pycache_prefix=/path/outside/nfs
to Python (also, is this even necessary)? - Will using NFS to store Python package source code (not data) cause a performance hit?
- This could all be avoided if it were possible to tell “treat x module as unimportable like main; serialize it when it is referenced”. Yes it will take a long time, but this is a package that is constantly changing, so 80% of the times, the package would have to be sent just before runtime anyway. Is there such an option?
The containerization/packaging methods seem like you have to redeploy the cluster (i.e., docker stop dask-worker
, docker rm dask-worker
, docker run dask-worker:new-image
on every node), which makes sense if updating the package set is rare (not my use case). Then client.upload_file
and distributed.diagnostics.plugin.PipInstall
seem too ad hoc, but maybe I could put a convincing wrapper around it that hides everything?
Any answer, even to a subset of the questions, is much appreciated. If there is a consensus that explaining the operation of NFS clusters is worthwhile, I will add them to this page.