I’m having some trouble distributing computations on a ddf consisting of multiple CSVs.
Following this guide, I’m reading in the files using dd.read_csv() with glob notation, and then reasssigning df = client.persist(df). When I subsequently perform any computations on this df (eg df.head()), I trigger the following exception:
AttributeError: 'str' object has no attribute 'head'
If I evaluate df.compute(), instead of returning the full pandas dataframe as expected, the product is a
pd.Series of the read-csv task keys
0 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 0)
1 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 1)
2 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 2)
3 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 3)
4 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', 4)
...
348 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
349 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
350 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
351 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
352 ('read-csv-f7d3979c065ca498a266439f1dc9a7e9', ...
Length: 353, dtype: object
Is this a bug, or am I fundamentally misunderstanding something? I have tested with both my own dataset (a set of CSVs on my local drive) and the S3 bucket linked above, as well as both on a LocalCluster and an LSFCluster and received the same results.