SSH Cluster - Proper environment configuration on Worker side

Consider the below in main.py:

cluster = SSHCluster(["localhost", "localhost"], 
                        connect_options={"known_hosts": None},
                        worker_options={"n_workers": 6], },
                        scheduler_options={"port": 0, "dashboard_address": ":8797"},)

client = Client(cluster)

and another module test_dask.py, which prepares a params array with different parameters for each worker, and passes them to dask.delayed as below (worker_method exists in the same module):

delayed_results = [dask.delayed(worker_method)(dask_params[i]) for i in range(6)]

computed_results = dask.compute(*delayed_results)

The code above, fails with the below error:

No module named test_dask.py

This seems to be well reported. However, the only thing that worked for me was calling the below for every module required by test_dask.py as well as test_dask.py itself, right after instantiating the “client” object.

client.upload_file('simulator/test_dask.py')

Uploading all the required modules to every worker that is started on a remote node seems like an overkill. I would rather make sure that the modules exist on the remote nodes and somehow instruct the worker nodes to point to them (as well as the same environment in general).

Furthermore, as one can see above, I am not even using a remote node yet (both scheduler and workers are on localhost). I have created a virtual environment in the root of my project. Consider the below structure:

mtdcabm

  • bin
  • lib
  • lib64
  • simulator
    • main.py
    • test_dask.py
    • other_modules.py

The interpreter being used on the client / scheduler is bin/Python3.11.

How can I dynamically instruct the workers to point to the specific Python interpreter and virtual environment that I want and not have to rely on the upload_file method to upload specific modules remotely?

I am using Python 3.11.2 on Ubuntu 22.04.3.

This is always a pain point with distributed setup. Did you try to use the remote_python kwarg from dask-ssh?

I’m not sure there is an easy way to set PYTHONPATH with dask-ssh…

Yes, I tried to use the remote_python kwarg. It does not make a difference. I was passing the path to the bin/Python3.11 (the bin folder exists in the virtual enviornment). Is this correct or do I have to pass the path to the virtual environment (in my case the root of the project)? The “upload_file” method works, however it is somewhat slow (and likely to become even slower with increasing nodes). If I loaded a static environment on each remote node, it is likely to be much faster. Do you have any other ideas to try out?

@guillaumeeb did some more digging into; logged the python interpreter and current working directory for the workers; and without specifying anything for the interpreter, it seems to have defaulted to the same interpreter within the virtual environment of the client. not sure, why and how. The current working directory however is: home/jurgen, rather than the path to the virtual directory.

I tried to set the correct path as follows:

worker_options={"n_workers": params["numworkers"], "local_directory": config["worker_working_directory"], },

But this had no effect. It doesn’t crash but it doesn’t work either. “home/jurgen” was still returned by the workers as current working directory via os.getcwd().

I tried the same approach with scheduler_options, however, it also doesn’t work (the scheduler is in fact complaining about not having access to the modules).

How can I check the interpreter and current working directory for the scheduler? And how do I set these properties dynamically? scheduler_options doesn’t seem to feature “local_directory” at all, or in any case, it does not have any effect whatsoever.

Well this is what I would expect.

Just a question, if the Python environment is correct, why do you need to have a specific directory for Workers, and moreover for the Scheduler?

There is nothing wrong with the scheduler and workers using the same interpreter, actually that is much better. What I am concerned with is loading the required modules on the scheduler, without having to explicitly upload them all, one by one, via the client.upload_file() method (which is what I am currently having to do). If I was able to set the working directory to the “virtual environment” where the modules reside, I assume I would not require the client.upload_file, no? or have I got it all wrong?

Maybe, but I’m not even sure this would be enough, not sure what the Worker start script takes as PYTHONPATH by default.

My idea was that your local files where also installed/deployed in the current virtual environment. Could you do that? Try to add some pip machinery or other to install the files in the env? This way you’d just have to do a pip install . before trying your code.

Another solution I’m thinking of is trying to look at the connect_options kwarg, I can see in API Documentation — AsyncSSH 2.14.1 documentation that you should be able to provide environment variable. Not sure it’ll work though.

A final approach could be to use the .bashrc file to set some environment variable upon ssh connection.