Using DataFrame apply in a loop

pavithraes · August 5, 2022, 7:33pm

@Ole Welcome to Discourse!

This is to be expected. Dask is computed lazily – it only creates the “task graph” or the logic of your computation during the for-loop. When the column values are calculated (at the very end after the for loop, when you call compute), the value of i is 3, and this value is used to calculate each column. That’s why all the columns 0 to 3 look alike.

However, it does evaluate the metadata, like the new column names immediately if possible. Hence, we have the correct column names.

Does this make sense?

This is a known limitation of parallel and distributed computation, and the best practice is to avoid global state.

However, if I do this outside of a loop

In this example, you’re using integers instead of a variable like i – that’s why you get correct results.

You can maybe use something like client.submit instead:

import dask.dataframe as dd
import pandas as pd
from dask.distributed import Client

client = Client()

numbers = range(0, 4)

ddf = dd.from_pandas(pd.DataFrame({"x": [0, 1, 2, 3]}), npartitions=1)


def func(i, s):
    return s.apply(lambda a: a if a == i else 99, meta=(str(i), "int64"))


for i in numbers:
    result = client.submit(func, i, ddf.x)
    ddf[str(i)] = result.result()

ddf.compute()

Topic		Replies	Views
DataFrame created by DataFrame.apply() Dask DataFrame	1	2215	April 27, 2022
How to parallel process .apply with a lambda function within a for loop? Dask DataFrame	2	422	February 28, 2023
Meta='int' failed Dask DataFrame	1	220	January 15, 2022
DDF is converting column of lists/dicts to strings Dask DataFrame	2	1016	January 18, 2024
Why dask runs with no results? Dask DataFrame	6	338	June 30, 2023

Using DataFrame apply in a loop

Related topics