Hi,
I am looking for a way to change the value of the last row of a particular column in a multi-column Dask dataframe.
The below approaches seems to work for a Pandas dataframe
import pandas as pd
# Create a sample Dask DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Method 1: Using .loc
df.loc[df.index[-1], 'A'] = 10
# Method 2: Using .iloc
df.iloc[-1, df.columns.get_loc('A')] = 20
However, it does not work with a Dask dataframe.
# Create a sample Dask DataFrame
df = dd.from_pandas(pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}), npartitions=1)
# Method 1: Using .loc
df.loc[df.index[-1], 'A'] = 10
# Method 2: Using .iloc
df.iloc[-1, df.columns.get_loc('A')] = 20
Please help if you know of a way to achieve this.
Hi @Damilola,
Dask DataFrames (and any other Dask collections) are immutable objects, due to its distributed and resilient nature. You cannot just change a particular value in place.
The only solution would be to create a new DataFrame from the input one, and modify only the last partition, but it will not be really efficient. Is there some good reason why you’d want to achieve the modification of a single value only?
If you really need something like this, the only thing that come to my mind is using map_partitions
with the special keyword partition_info
inside your callable function:
Your map function gets information about where it is in the dataframe by accepting a special partition_info
keyword argument.
See dask.dataframe.DataFrame.map_partitions — Dask documentation.