I am wondering what is the recommended way to pass a modified method to map_partitions
.
Working in jupyter with dask version 2025.4.1
. Here is an illustrative example:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)
def myadd(df, a, b=1):
return df.x + df.y + a + b
res = ddf.map_partitions(myadd, 1, b=2)
res.compute()
output:
0 5.0
1 7.0
2 9.0
3 11.0
4 13.0
dtype: float64
Updating myadd
according to
def myadd(df, a, b=1):
return df.x + df.y + a + b + 1
res = ddf.map_partitions(myadd, 1, b=2)
res.compute()
outputs:
0 5.0
1 7.0
2 9.0
3 11.0
4 13.0
dtype: float64
So the update of myadd was not passed around and the result is the same.
Defining the method as static seems to work on the other hand:
@staticmethod
def myadd(df, a, b=1):
return df.x + df.y + a + b + 1
res = ddf.map_partitions(myadd, 1, b=2)
res.compute()
outputs:
0 6.0
1 8.0
2 10.0
3 12.0
4 14.0
dtype: float64
Is this the recommended way to do this?
I am fairly convinced older dask versions were more robust and provided the “correct” output without the static definition.
In any case, I did not find any mention of this in the dask documention nor on the web.
More visibility of this behavior would be welcomed to my opinion.