Problems reading .csv files

While reading on a jupyter notebook:

import dask
import dask.dataframe as dd
df = dd.read_csv(‘bigdata.csv’, blocksize=5e6)
df

Is fine, if I then issue df.compute() or df.head() results error:
FileNotFoundError: [Errno 2] No such file or directory: ‘…/bigdata.csv’

df.map_partitions(type) gives:

Dask Series Structure:
npartitions=9
object




dtype: object

Hi @wayward, the reason why before doing the df.compute() or df.head() things seem to be “fine” is because up to that point nothing has happened yet except for setting up the task graph.

It seems the problem you are having is that it can’t find the file "bigdata.csv". If you only provide the name of the file, in order to be able to read it, the file needs to be in the same location (path) as your jupyter notebook. If it is not, then you need to provide the relative path to where the file is. For example:

your_folder/
        |_ jupyter_notebook.ipynb
        |_ another_folder/
                 |_ bigdata.csv

then the relative path to your file would be “another_folder/bigdata.csv”

2 Likes

hello, I used remote workers that (obviously) could not access local-to-scheduler files …

1 Like

Hey @wayward! I think it was not clear from your snippet that you were using a cluster setup. What you’re looking for might be covered in the Futures docs, but if you could share a minimal reproducer or a bit more about your setup then we can troubleshoot further.

@wayward Typically if you are using a remote cluster, you will want your data to reside in some place that all your workers can access it. This could be a blob storage system like S3, a data warehouse, or some kind of network filesystem (see here for more information).

As an aside, please be more considerate in the language you use in discussions here. Things that are obvious to you about your setup might not be obvious to others, and lots of people do use Dask on a single machine where all workers can see the same filesystem.

3 Likes