Why align_partitions() use force=True?

epizut · February 2, 2023, 2:27pm

Hi,

When passing severals dask dataframes to map_partitions(), it looks like the underling align_partitions() force repartitioning every dd even if only a single one needs it.

dds = [dask.datasets.timeseries() for i in range(5)]
# Only the latest dd partitionning will differ
dds[-1] = dds[-1].repartition(npartitions=1)

def func(*args):
    return args[0]

dd = dask.dataframe.map_partitions(func, *dds)
dd.dask

guillaumeeb · February 2, 2023, 4:52pm

Hi @epizut,

That is correct, you can track that down in the source code:

github.com

dask/dask/blob/main/dask/dataframe/core.py#L6669


      
          >>> elemwise(operator.add, df.x, df.y)  # doctest: +SKIP
          """
          _name = funcname(op) + "-" + tokenize(op, *args, **kwargs)
          
          args = _maybe_from_pandas(args)
          
          from dask.dataframe.multi import _maybe_align_partitions
          
          args = _maybe_align_partitions(args)
          dasks = [arg for arg in args if isinstance(arg, (_Frame, Scalar, Array))]
          dfs = [df for df in dasks if isinstance(df, _Frame)]
          
          # Clean up dask arrays if present
          deps = dasks.copy()
          for i, a in enumerate(dasks):
              if not isinstance(a, Array):
                  continue
              # Ensure that they have similar-ish chunk structure
              if not all(not a.chunks or len(a.chunks[0]) == df.npartitions for df in dfs):
                  msg = (
                      "When combining dask arrays with dataframes they must "

github.com

dask/dask/blob/e2dfb3ce1866ced9707bc6b6de22e0fb7eb613ab/dask/dataframe/multi.py#L173


      
          
              Note that if all divisions are unknown, but have equal npartitions, then
              they will be passed through unchanged. This is different than
              `align_partitions`, which will fail if divisions aren't all known"""
              _is_broadcastable = partial(is_broadcastable, args)
              dfs = [df for df in args if isinstance(df, _Frame) and not _is_broadcastable(df)]
              if not dfs:
                  return args
          
              divisions = dfs[0].divisions
              if not all(df.divisions == divisions for df in dfs):
                  dfs2 = iter(align_partitions(*dfs)[0])
                  return [a if not isinstance(a, _Frame) else next(dfs2) for a in args]
              return args
          
          
          def require(divisions, parts, required=None):
              """Clear out divisions where required components are not present
          
              In left, right, or inner joins we exclude portions of the dataset if one
              side or the other is not present.  We can achieve this at the partition

github.com

dask/dask/blob/e2dfb3ce1866ced9707bc6b6de22e0fb7eb613ab/dask/dataframe/multi.py#L138


      
              raise ValueError(
                  "Not all divisions are known, can't align "
                  "partitions. Please use `set_index` "
                  "to set the index."
              )
          
          divisions = list(unique(merge_sorted(*[df.divisions for df in dfs1])))
          if len(divisions) == 1:  # single value for index
              divisions = (divisions[0], divisions[0])
          dfs2 = [
              df.repartition(divisions, force=True) if isinstance(df, _Frame) else df
              for df in dfs
          ]
          
          result = list()
          inds = [0 for df in dfs]
          for d in divisions[:-1]:
              L = list()
              for i, df in enumerate(dfs2):
                  if isinstance(df, _Frame):
                      j = inds[i]

However, even if you see the repartition-merge layer on all Dataframes, Dask is smart enough not to do useless work in the background. I’ve reduced a bit the size of your example:

import dask
import dask.dataframe as dd

dds = [dask.datasets.timeseries(end="2000-01-05") for i in range(3)]
# Only the latest dd partitionning will differ
dds[-1] = dds[-1].repartition(npartitions=1)

def func(*args):
    return args[0]

dd = dask.dataframe.map_partitions(func, *dds)

This is the task graph generated when calling dd.visualize():

You can see that the repartition-split layer is actually not changing anything on the data partitions.

But maybe you’re right and the code could be optimized in some way.

epizut · February 6, 2023, 10:01am

I found this repartition force=True by investigating a very long graph packing/unpacking time before computing starts. I think it would help a lot to avoid any non-necessary graph nodes.

Should I create a PR to remove force=True? Do you see any advantage in forcing the repartition for every dd? A better visualization maybe? Does it worth inflating the graph size?

epizut · February 6, 2023, 10:06am

I think I am missing something here because the force parameter in repartition() has been created on purpose for this specific use case.

guillaumeeb · February 6, 2023, 4:44pm

I think the force kwarg of repartition is unrelated to the align_partition call that is done no matter what.

At his point, if this is causing a problem in your use case, I think you should open an issue in Dask github to discuss this. Something like: “Repartition dataframes that only need it when passing several ones to map_partitions”. I’m not sure if the solution is to just remove the force=True, but maybe elsewhere.

And I think in the end you’re right, if the tasks graph can be optimize, let’s try to do it!

epizut · February 6, 2023, 5:18pm

Thank you for your help, let’s continue here

Topic		Replies	Views
Partition-wise joins (perfectly aligned partitions) using map_partitions Dask DataFrame	1	19	November 29, 2024
Align a secondary DataFrame to use the same workers and index structure as a primary DataFrame Dask DataFrame	6	50	January 30, 2025
When to repartition if the data is not computed? Dask DataFrame	6	790	December 3, 2021
Best way to partition a dataframe respecting boundaries of row subgroups Dask DataFrame	1	210	April 28, 2022
List of Dask Dataframe operations that could be run in parallel without using map_partitions Dask DataFrame	4	39	December 6, 2024

Why align_partitions() use force=True?

Related topics