@Ole Welcome to Discourse!
This is to be expected. Dask is computed lazily – it only creates the “task graph” or the logic of your computation during the for-loop. When the column values are calculated (at the very end after the for loop, when you call compute
), the value of i
is 3, and this value is used to calculate each column. That’s why all the columns 0 to 3 look alike.
However, it does evaluate the metadata, like the new column names immediately if possible. Hence, we have the correct column names.
Does this make sense?
This is a known limitation of parallel and distributed computation, and the best practice is to avoid global state.
However, if I do this outside of a loop
In this example, you’re using integers instead of a variable like i
– that’s why you get correct results.
You can maybe use something like client.submit
instead:
import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client
client = Client()
numbers = range(0, 4)
ddf = dd.from_pandas(pd.DataFrame({"x": [0, 1, 2, 3]}), npartitions=1)
def func(i, s):
return s.apply(lambda a: a if a == i else 99, meta=(str(i), "int64"))
for i in numbers:
result = client.submit(func, i, ddf.x)
ddf[str(i)] = result.result()
ddf.compute()