I’m having some trouble distributing computations on a ddf consisting of multiple CSVs.
Following this guide, I’m reading in the files using dd.read_csv()
with glob notation, and then reasssigning df = client.persist(df)
. When I subsequently perform any computations on this df (eg df.head()
), I trigger the following exception:
AttributeError: 'str' object has no attribute 'head'
If I evaluate df.compute()
, instead of returning the full pandas dataframe as expected, the product is a
pd.Series
of the read-csv
task keys
0 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 0)
1 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 1)
2 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 2)
3 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 3)
4 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 4)
...
348 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
349 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
350 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
351 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
352 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
Length: 353, dtype: object
Is this a bug, or am I fundamentally misunderstanding something? I have tested with both my own dataset (a set of CSVs on my local drive) and the S3 bucket linked above, as well as both on a LocalCluster
and an LSFCluster
and received the same results.