Following the 01_dataframe.ipynb in dask-tutorial. I want to use two separate computers to do the job. One is the scheduler and the other is worker.
I successfully make the worker connect to the scheduler. And run the tutorial on the scheduler. the following code is:
The error shows: FileNotFoundError: [Errno 2] No such file or directory: '[/media/lk/lksgcc/lk_git/1_BigData/dask-tutorial/data/nycflights/1999.csv](https://file+.vscode-resource.vscode-cdn.net/media/lk/lksgcc/lk_git/1_BigData/dask-tutorial/data/nycflights/1999.csv)
This means the data in the worker computer is not identified.
My question is how to set a certain data location for the worker to read in the tutorial?
The data needs to be accessible both from the Client and Worker machine at the same location. So you’ll need to have some shared space among Scheduler of Worker machine for the use case to work.
You can then change this location into the read_csv function call.
Thank you @guillaumeeb
How could I set the shared space and change the location into the read_csv function call?
For example, both Scheduler and worker machine has the data. But they have different absolute path. The absolute path for the Scheduler is /media/lk/lksgcc/lk_git/1_BigData/dask-tutorial/data/nycflights. The absolute path for the Worker machine is /media/lk2/Data/dask-tutorial/data/nycflights. The two path is not the same. So the error in the Worker also shows FileNotFoundError. Can you show me an example?
Thanks.
I guess you’ve seen it, but the location is the first argument to read_csv.
Unfortunately, there is no way I’m aware of to give a different path for Client/Scheduler and Workers. You’ve got to make sure every component of your Dask cluster see the same path.