Package upload - best practice

giucesar · April 20, 2022, 10:25pm

Hi.
I am developing a code consisting of a huge package tree. It works as expected when I run locally or when I submit to dask-scheduler to distribute across my cluster.
I have a script that “compiles” the python code in an egg file (remove old egg && python setup.py bdist_egg) every time I run with the most recent code and upload it to the client.

At the beginning of the code, I run:

    client = Client(address=config["dask_server"], name="project-client")
    filename = config["dist_file"]
    client.register_worker_plugin(UploadFile(filename), name="egg-package")
    client.unregister_worker_plugin(name="egg-package")
    client.restart()
    # registering twice while I can't figure out a better way
    client.register_worker_plugin(UploadFile(filename), name="egg-package")

The workers are already running on the cluster through the dask-worker command.

This process works, but it is annoying.
If the egg file changes, the worker says “bad file header,” and I have to kill the workers, delete the temp dir and start the workers again. Then, the same egg file it complained about works.

Is there a better way to handle this process? If the package was done, I could install it on my worker nodes, but the egg will frequently change.

I tried to find a way to clean the worker temp directory when the client connects but could not find anything in the forums or documentation. Also, if I retire the workers, the dask-worker process dies in the node machines, and I need to connect manually.

I feel I am missing something silly and getting bogged down.

bryanweber · May 2, 2022, 2:25pm

Hi! Sorry for the delay in replying. This is a somewhat complicated situation, so what I’m suggesting may not work for you. That said, the problem is that Python needs to re-install the egg when it’s updated. My solution has 2 parts:

Run pip install -e on the workers
Update the code in the path where pip installed from.

The first step shouldn’t be too hard, since you’re already doing it. The second step is a little trickier… I was thinking possibly git pushing to a (private?) repo, you can either do a webhook or have the workers poll for updates and git pull the code.

I don’t know if this is a good idea or not, but it should resolve the egg errors you’re getting.

giucesar · May 2, 2022, 3:17pm

Humm…
I could rsync with all workers and ensure they have the same code at the run time. I will give it a try later in the week.
Thanks.

Topic		Replies	Views
Updating packages in a Dask cluster Distributed	4	242	April 11, 2023
How to use Built-In WorkerPlugin to import code when worker spawns Distributed dask-gateway , distributed	1	39	November 27, 2024
Dask uploading local code to remote workers Distributed distributed	2	342	March 15, 2022
PYTHONPATH setup Distributed	2	527	December 23, 2022
Best Practices for Running Dask Clients with Local Code on a Shared Remote Cluster Distributed	1	18	June 6, 2025

Package upload - best practice

Related topics