Using DataFrame apply in a loop

@Ole Welcome to Discourse!

This is to be expected. Dask is computed lazily – it only creates the “task graph” or the logic of your computation during the for-loop. When the column values are calculated (at the very end after the for loop, when you call compute), the value of i is 3, and this value is used to calculate each column. That’s why all the columns 0 to 3 look alike.

However, it does evaluate the metadata, like the new column names immediately if possible. Hence, we have the correct column names.

Does this make sense?

This is a known limitation of parallel and distributed computation, and the best practice is to avoid global state.

However, if I do this outside of a loop

In this example, you’re using integers instead of a variable like i – that’s why you get correct results.

You can maybe use something like client.submit instead:

import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client

client = Client()

numbers = range(0, 4)

ddf = dd.from_pandas(pd.DataFrame({"x": [0, 1, 2, 3]}), npartitions=1)


def func(i, s):
    return s.apply(lambda a: a if a == i else 99, meta=(str(i), "int64"))


for i in numbers:
    result = client.submit(func, i, ddf.x)
    ddf[str(i)] = result.result()

ddf.compute()
1 Like