I have a large csv in a shared folder and I want to work on it in a partitioned way. To read the csv I’m using dask.dataframe.read_csv(path, blocksize=…) and then I’m just running ddf.map_partitions(f).compute().
The problem is that my client is on windows and my workers are on linux and hence they have a different path-string to the shared file (accessible on both).
worker path: /mnt/share/file.csv
client path: \server\share\file.csv
Is there any way to solve this? Did some research but no luck until now.
Thanks in advance!
Hi @spheredb, welcome to Dask community.
There are two issues with your setup:
- The different paths on Client and Workers: the Dask Client needs to have access to the file metadata in order to compute partitions and distribute the tasks. The Workers need to have access to the file to read the blocks. There is no way to give a different path for Client and Workers.
- Client and Workers are on different OS (and probably different Python environment): this can also lead to tricky mistakes.
I would recommend to avoid this situation and run everything on Linux. Even with a different paths, you could probably try to create some symlink.
So the easy way would be to try to have the same path on both Client and Workers, using symbolic link, or maybe changing mount path, but I guess you cannot do this?
Maybe you could also try with some Environment variable that you’ll set both on Client side and Worker side but with a different value, and give a path using this environment variable.
I don’t have any other simple suggestion, a more complex one would be to implement the data reading your self, analyzing the file on Client side, and generating custom task to read it on Workers with a different path…