Hello,
I have a big dataset (2TB), and I would like to create some aggregations and then merge with itself, and save each partition.
The script does something similar to:
def aggregate_merge(ddpartition):
ddpartition_agg = ddpartition.groupby(...).aggr(...)
ddpartition = ddpartition.merge(ddpartition_agg, on=...)
ddpartition.to_parquet(...)
return None
dd_ = dd.read_csv(...)
dd_.map_partitions(aggregate_merge).compute()
I have a few doubts:
-
This computation filled up my memory. I was hoping that returning None each time, the workers freed the memory when they finished with the partition (or when the mem. is needed), but it keeps accumulated and the worker get killed at the end. Is there a better way to do this processing?
-
The documentation of map_partitions doc. says “that the index and divisions are assumed to remain unchanged.” So, probably, it is undesirable that my function aggregate_merge returns None. What other function can I use instead of map_partitions if this is not appropriate for what I would like to do?
-
I’ve also tried:
dd_ = dd.read_csv(...)
dd_ = dd_.map_partitions(aggregate_merge)
dd_.to_parquet('s3://...')
but as the aggregation takes a lot of time, I get connection error. Creating smaller partitions alleviate a bit the problem but I still get connection errors. So, is it anyway I can setup the connection timeout of the s3fs? I am using pyarrow, but I can not find the way for setting this parameter.
Many thanks in advance for the help.