How to work with distributed dataframes?

huangy6 · July 19, 2023, 4:53am

I’m having some trouble distributing computations on a ddf consisting of multiple CSVs.

Following this guide, I’m reading in the files using dd.read_csv() with glob notation, and then reasssigning df = client.persist(df). When I subsequently perform any computations on this df (eg df.head()), I trigger the following exception:

AttributeError: 'str' object has no attribute 'head'

If I evaluate df.compute(), instead of returning the full pandas dataframe as expected, the product is a
pd.Series of the read-csv task keys

0       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 0)
1       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 1)
2       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 2)
3       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 3)
4       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 4)
                             ...                        
348    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
349    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
350    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
351    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
352    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
Length: 353, dtype: object

Is this a bug, or am I fundamentally misunderstanding something? I have tested with both my own dataset (a set of CSVs on my local drive) and the S3 bucket linked above, as well as both on a LocalCluster and an LSFCluster and received the same results.

huangy6 · July 19, 2023, 7:56pm

Resolved!

Turns out I was having deserialization problems from an outdated version of msgpack (1.0.3)

Topic		Replies	Views
Unable to get head of a CSV read dask dataframe Dask DataFrame distributed	2	675	March 13, 2023
How to check that a dataframe is properly built? Dask DataFrame	3	47	November 27, 2024
How does read_csv or read_parquet distribute read operations? Dask DataFrame	3	318	June 14, 2022
Different path-string on client and worker/scheduler Dask DataFrame	1	307	April 28, 2023
Problems reading .csv files Dask DataFrame	4	334	January 7, 2022

How to work with distributed dataframes?

Related topics