Map_partitions just to execute and save per partition

rdanger · September 28, 2022, 3:42pm

Hello,
I have a big dataset (2TB), and I would like to create some aggregations and then merge with itself, and save each partition.

The script does something similar to:

def aggregate_merge(ddpartition):
      ddpartition_agg  = ddpartition.groupby(...).aggr(...)
      ddpartition = ddpartition.merge(ddpartition_agg, on=...)
      ddpartition.to_parquet(...)
      return None

dd_ = dd.read_csv(...)
dd_.map_partitions(aggregate_merge).compute()

I have a few doubts:

This computation filled up my memory. I was hoping that returning None each time, the workers freed the memory when they finished with the partition (or when the mem. is needed), but it keeps accumulated and the worker get killed at the end. Is there a better way to do this processing?
The documentation of map_partitions doc. says “that the index and divisions are assumed to remain unchanged.” So, probably, it is undesirable that my function aggregate_merge returns None. What other function can I use instead of map_partitions if this is not appropriate for what I would like to do?
I’ve also tried:

dd_ = dd.read_csv(...)
dd_ = dd_.map_partitions(aggregate_merge)
dd_.to_parquet('s3://...')

but as the aggregation takes a lot of time, I get connection error. Creating smaller partitions alleviate a bit the problem but I still get connection errors. So, is it anyway I can setup the connection timeout of the s3fs? I am using pyarrow, but I can not find the way for setting this parameter.

Many thanks in advance for the help.

Topic		Replies	Views
How does Dask determine partitions? Dask DataFrame partitioning , distributed	2	544	January 24, 2023
Memory issues arising from writing partitions with to_parquet	5	716	September 18, 2023
Map_partitions question for image processing Dask DataFrame	6	838	February 21, 2022
Dask saving dataframe partitions as files Dask DataFrame distributed	1	511	May 25, 2022
Understanding partitions, groupby, and memory usage Dask DataFrame groupby , aggregation , partitioning	1	1320	February 15, 2024

Map_partitions just to execute and save per partition

Related topics