Unable to get head of a CSV read dask dataframe

Francois · March 13, 2023, 9:27am

Hi,

I forward here a message posted on stackoverflow

Here is the failing code

import dask.dataframe as dd
from dask.distributed import Client, LocalCluster
import pandas as pd

local_file = 'example.csv'
df0 = pd.DataFrame({'id':[0,1,3], 'model':['A', 'B', 'C']})
df0.to_csv(local_file)

if __name__ == '__main__':
    with LocalCluster(processes=False) as cluster, Client(cluster) as client: 
        df = dd.read_csv(local_file)
        print('df :')
        print(df.compute())
        df.head()

Inside the local cluster/client, the df.compute doesn’t return pandas dataframe with values inside, but rather a “Serialize” graph.
And then df.head returns error. Which is not the case outer of the client.

Doesn’t anyone can fix this seeming bug ?

guillaumeeb · March 13, 2023, 11:06am

Hi @Francois,

I saw your Stackoverflow question, but did not have time to really look at it. I reproduced the issue.

It seems to me that there is a problem with the LocalCluster or Client context manager, because just doing:

client = Client()
df = dd.read_csv(local_file)
print('df :')
print(df.compute())
df.head()

works as expected.

I think you should open an issue in distributed Github repo.

Francois · March 13, 2023, 7:09pm

Thanks for your test and your advice. I’ll do as you suggest

Topic		Replies	Views
How to work with distributed dataframes? Dask DataFrame distributed	1	158	July 19, 2023
Problems reading .csv files Dask DataFrame	4	330	January 7, 2022
Serialization error when converting Dask Dataframe to Dask Array Dask DataFrame dask-array , distributed	2	1383	May 11, 2022
How does read_csv or read_parquet distribute read operations? Dask DataFrame	3	316	June 14, 2022
Unable to create Dask dataframe at scale Dask DataFrame	6	878	October 22, 2022

Unable to get head of a CSV read dask dataframe

Related topics