How to get the maximum value from dask dataframe column of list values

Damilola · April 19, 2025, 6:04am

Hi, I am trying to get the maximum value from a dask dataframe column of list values with the possibility that some of the list values can be None.

I am trying the below logic.

  import dask.dataframe as dd
  import pandas as pd
  
  # Sample data
  data = {'list_column': [[1, 2, 3], [None, 5, 6], [7, None, 9], [None, None, None]]}
  df = pd.DataFrame(data)
  ddf = dd.from_pandas(df, npartitions=1)
  
  def safe_numeric_max(row):
      # Keep only numeric (int/float) values
      numerics = [x for x in row if isinstance(x, (int, float))]  
      return max(numerics) if numerics else None
  
  ddf['max_val'] = ddf['list_column'].map(safe_numeric_max, meta=('max_val', 'float64'))
  
  ddf.compute()

However, I am always getting the None as the maximum value

	list_column	        max_val
0	[1, 2, 3]	        None
1	[None, 5, 6]	    None
2	[7, None, 9]	    None
3	[None, None, None]	None

In Pandas, I see no issue

import pandas as pd

# Sample data
data = {'list_column': [[1, 2, 3], [None, 5, 6], [7, None, 9], [None, None, None]]}
df = pd.DataFrame(data)

def safe_numeric_max(row):
    # Keep only numeric (int/float) values
    numerics = [x for x in row if isinstance(x, (int, float))]  
    return max(numerics) if numerics else None

df['max_val'] = df['list_column'].map(safe_numeric_max)

print(df)

	list_column	        max_val
0	[1, 2, 3]	        3.0
1	[None, 5, 6]	    6.0
2	[7, None, 9]	    9.0
3	[None, None, None]	NaN

In a regular Python approach, I also see no issue

lst  = [7, None, 9]

numerics = [x for x in lst if isinstance(x, (int, float))]  

max(numerics)

Can you please help if you know what I might be doing wrong here or suggest an alternative approach?

Thanks

guillaumeeb · April 25, 2025, 12:26pm

Hi @Damilola,

Since the introduction of PyArrow, the default behavior is to force a PyArrow[String] type for complex columns. You just need to disable this behavior:

from dask import config
config.set({"dataframe.convert-string": False})
ddf = dd.from_pandas(df, npartitions=1)
ddf['max_val'] = ddf['list_column'].map(safe_numeric_max, meta=('max_val', 'float64'))
ddf.compute()

Damilola · April 27, 2025, 7:44pm

Thanks for the response and the suggestion.

With the suggestion you have provided, I am now able to get it working

import dask
import dask.dataframe as dd
import pandas as pd

dask.config.set({"dataframe.convert-string": False})

# Sample data
data = {'list_column': [[1, 2, 3], [None, 5, 6], [7, None, 9], [None, None, None]]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=1)

def safe_numeric_max(row):
    # Keep only numeric (int/float) values
    numerics = [x for x in row if isinstance(x, (int, float))]  
    return max(numerics) if numerics else None

ddf['max_val'] = ddf['list_column'].map(safe_numeric_max, meta=('max_val', 'float64'))

ddf.compute()


    list_column	        max_val
0	[1, 2, 3]	        3.0
1	[None, 5, 6]	    6.0
2	[7, None, 9]	    9.0
3	[None, None, None]	NaN

Topic		Replies	Views
How to sum the elements of a list column in a Dask Dataframe Dask DataFrame	2	27	December 16, 2024
DDF is converting column of lists/dicts to strings Dask DataFrame	2	1026	January 18, 2024
How to write and read DataFrame with vector column (e.g. list(float64))? Dask DataFrame	2	1049	September 4, 2023
"IntigercastingNaNError: Cannot convert non-finite value (NA or inf) to integer" Dask DataFrame	4	2152	December 12, 2023
Dask DataFrames getting stuck on Google Colab Dask DataFrame	3	202	August 27, 2023

How to get the maximum value from dask dataframe column of list values

Related topics