My question is regarding the computation of Dask graphs. This is my current understanding: I read multiple files in and partition them by size, 50MB for example. Now I do drop_duplicates and groupby’s and so on, and then save the resulting partitions back to disk, one file per partition. This will prevent my process from running out of memory, because I am not calling compute() anywhere until the files are saved.
What I am confused about is the relation to the “split” (_every and _out) parameters found in a lot of the methods, for example groupby.last(). My .last() result is MOST of the times around the same size as the original data, but not ALWAYS, so I don’t want to use a magic number like split_out=8 because I don’t want to create unnecessary partitions if the result is small enough (and if the data is reduced to less than 8 entries, which SHOULD not happen but CAN, this will even cause a runtime error). But will the process run out of memory if split_out is not specified and defaulted to 1 partition? Does this even have an impact if I don’t compute the data until it is saved to disk again (and repartitioned by size right before)?
I guess my question is, WHEN should you repartition at all? Does it have a memory impact if not computed during the process, when lazy methods are still being called? And why do these functions only have a split_out parameter for the number of partitions and not a partition_size option like repartition()?