How to sum the elements of a list column in a Dask Dataframe

Damilola · December 13, 2024, 8:49pm

Hi,

I am trying to sum the elements of a list column in a Dask Dataframe column.

import pandas as pd
import dask.dataframe as dd

# Create a Dask DataFrame
data = {'A': [[1, 2], [3, 4], [5, 6]]}
df = dd.from_pandas(pd.DataFrame(data), npartitions=1)

# Sum the list entries in column 'A'
df['sum'] = df['A'].apply(sum, meta=('sum', 'int64')) # meta is required for Dask

# Compute the result
result = df.compute()

print(result)

However, I am always getting the following error message

TypeError: unsupported operand type(s) for +: ‘int’ and ‘str’

In Pandas, the same operation can be done very easily with no errors

import pandas as pd

# Create a Dask DataFrame
data = {'A': [[1, 2], [3, 4], [5, 6]]}
df = pd.DataFrame(data)

# Sum the list entries in column 'A'
df['sum'] = df['A'].apply(sum)

Do you have any ideas or suggestions on how to fix this in Dask?

Thanks

martindurant · December 16, 2024, 2:20pm

You might be interested to look at akimbo, which is designed exactly for complex data types stored in normal dataframes. The code to get you want would be

import akimbo.dask
df.A.ak.sum(axis=-1)

Note, that this will not be particularly faster than map with sum, or delayed function with iteration/comprehensions, if your data starts off as python objects. BUT, if your data comes from a source which can directly make arrow vectorized data (e.g., parquet), then there is no conversion, and akimbo will tend to be much faster than pure python.

The origin of your exception, is that the data type is “object” in the original data, which dask interprets to mean “string” (which is indeed the most common case), so it is silently converting your lists to strings. I don’t know how to fix that, except to explicitly convert the data to arrow first (for which I would also use akimbo)

data = {"A": [[1, 2], [3, 4], [5, 6]] * 100000} # bigger data for benchmarking
df = dd.from_pandas(pd.DataFrame(data).ak.unpack(), npartitions=8)

In [49]: %timeit df.map(sum, meta="int64").compute()
370 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [52]: %timeit df.A.ak.sum(axis=-1).compute()
8.27 ms ± 85.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

martindurant · December 16, 2024, 2:20pm

Ref: akimbo documentation , GitHub - intake/akimbo: For when your data won't fit in your dataframe

Topic		Replies	Views
Meta='int' failed Dask DataFrame	1	220	January 15, 2022
Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False	1	1977	January 25, 2024
How to upload dataframe with numpy array column using to_parquet in dask.dataframe? Dask DataFrame	2	750	August 29, 2023
Using DataFrame apply in a loop Dask DataFrame	2	1201	August 5, 2022
Calculations in a column	0	292	October 26, 2022

How to sum the elements of a list column in a Dask Dataframe

Related topics