Different path-string on client and worker/scheduler

spheredb · April 26, 2023, 6:03pm

Hi,

I have a large csv in a shared folder and I want to work on it in a partitioned way. To read the csv I’m using dask.dataframe.read_csv(path, blocksize=…) and then I’m just running ddf.map_partitions(f).compute().

The problem is that my client is on windows and my workers are on linux and hence they have a different path-string to the shared file (accessible on both).
e.g.
worker path: /mnt/share/file.csv
client path: \server\share\file.csv

Is there any way to solve this? Did some research but no luck until now.

Thanks in advance!
Regards

guillaumeeb · April 28, 2023, 5:55am

Hi @spheredb, welcome to Dask community.

There are two issues with your setup:

The different paths on Client and Workers: the Dask Client needs to have access to the file metadata in order to compute partitions and distribute the tasks. The Workers need to have access to the file to read the blocks. There is no way to give a different path for Client and Workers.
Client and Workers are on different OS (and probably different Python environment): this can also lead to tricky mistakes.

I would recommend to avoid this situation and run everything on Linux. Even with a different paths, you could probably try to create some symlink.

So the easy way would be to try to have the same path on both Client and Workers, using symbolic link, or maybe changing mount path, but I guess you cannot do this?

Maybe you could also try with some Environment variable that you’ll set both on Client side and Worker side but with a different value, and give a path using this environment variable.

I don’t have any other simple suggestion, a more complex one would be to implement the data reading your self, analyzing the file on Client side, and generating custom task to read it on Workers with a different path…

Topic		Replies	Views
Dask-work show FileNotFoundError when running dask-array , distributed	3	577	May 13, 2023
Dividing data among workers and downloading data local to a worker Dask DataFrame	3	403	February 11, 2022
Problems reading .csv files Dask DataFrame	4	345	January 7, 2022
Dask saving dataframe partitions as files Dask DataFrame distributed	1	519	May 25, 2022
Making local files accessible to a Dask Gateway workers Distributed	4	313	February 7, 2022

Different path-string on client and worker/scheduler

Related topics