How to work with distributed dataframes?

I’m having some trouble distributing computations on a ddf consisting of multiple CSVs.

Following this guide, I’m reading in the files using dd.read_csv() with glob notation, and then reasssigning df = client.persist(df). When I subsequently perform any computations on this df (eg df.head()), I trigger the following exception:

AttributeError: 'str' object has no attribute 'head'

If I evaluate df.compute(), instead of returning the full pandas dataframe as expected, the product is a
pd.Series of the read-csv task keys

0       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 0)
1       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 1)
2       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 2)
3       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 3)
4       ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 4)
348    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
349    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
350    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
351    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
352    ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
Length: 353, dtype: object

Is this a bug, or am I fundamentally misunderstanding something? I have tested with both my own dataset (a set of CSVs on my local drive) and the S3 bucket linked above, as well as both on a LocalCluster and an LSFCluster and received the same results.


Turns out I was having deserialization problems from an outdated version of msgpack (1.0.3)

1 Like