Problems reading .csv files

wayward · January 1, 2022, 9:52pm

While reading on a jupyter notebook:

import dask
import dask.dataframe as dd
df = dd.read_csv(‘bigdata.csv’, blocksize=5e6)
df

Is fine, if I then issue df.compute() or df.head() results error:
FileNotFoundError: [Errno 2] No such file or directory: ‘…/bigdata.csv’

df.map_partitions(type) gives:

Dask Series Structure:
npartitions=9
object
…
…
…
…
dtype: object

ncclementi · January 3, 2022, 5:10pm

Hi @wayward, the reason why before doing the df.compute() or df.head() things seem to be “fine” is because up to that point nothing has happened yet except for setting up the task graph.

It seems the problem you are having is that it can’t find the file "bigdata.csv". If you only provide the name of the file, in order to be able to read it, the file needs to be in the same location (path) as your jupyter notebook. If it is not, then you need to provide the relative path to where the file is. For example:

your_folder/
        |_ jupyter_notebook.ipynb
        |_ another_folder/
                 |_ bigdata.csv

then the relative path to your file would be “another_folder/bigdata.csv”

wayward · January 6, 2022, 9:15pm

hello, I used remote workers that (obviously) could not access local-to-scheduler files …

scharlottej13 · January 7, 2022, 6:29pm

Hey @wayward! I think it was not clear from your snippet that you were using a cluster setup. What you’re looking for might be covered in the Futures docs, but if you could share a minimal reproducer or a bit more about your setup then we can troubleshoot further.

ian · January 7, 2022, 6:52pm

@wayward Typically if you are using a remote cluster, you will want your data to reside in some place that all your workers can access it. This could be a blob storage system like S3, a data warehouse, or some kind of network filesystem (see here for more information).

As an aside, please be more considerate in the language you use in discussions here. Things that are obvious to you about your setup might not be obvious to others, and lots of people do use Dask on a single machine where all workers can see the same filesystem.

Topic		Replies	Views
Dask-work show FileNotFoundError when running dask-array , distributed	3	565	May 13, 2023
Unable to get head of a CSV read dask dataframe Dask DataFrame distributed	2	675	March 13, 2023
Sequential reading of CSVs? Dask DataFrame	2	193	August 31, 2022
Unable to create Dask dataframe at scale Dask DataFrame	6	885	October 22, 2022
Using dask's read_csv or pandas's read_csv in from_map? Dask DataFrame distributed	3	81	July 31, 2024

Problems reading .csv files

Related topics