How to get the maximum value from dask dataframe column of list values

Hi, I am trying to get the maximum value from a dask dataframe column of list values with the possibility that some of the list values can be None.

I am trying the below logic.

  import dask.dataframe as dd
  import pandas as pd
  
  # Sample data
  data = {'list_column': [[1, 2, 3], [None, 5, 6], [7, None, 9], [None, None, None]]}
  df = pd.DataFrame(data)
  ddf = dd.from_pandas(df, npartitions=1)
  
  def safe_numeric_max(row):
      # Keep only numeric (int/float) values
      numerics = [x for x in row if isinstance(x, (int, float))]  
      return max(numerics) if numerics else None
  
  ddf['max_val'] = ddf['list_column'].map(safe_numeric_max, meta=('max_val', 'float64'))
  
  ddf.compute()

However, I am always getting the None as the maximum value

	list_column	        max_val
0	[1, 2, 3]	        None
1	[None, 5, 6]	    None
2	[7, None, 9]	    None
3	[None, None, None]	None

In Pandas, I see no issue

import pandas as pd

# Sample data
data = {'list_column': [[1, 2, 3], [None, 5, 6], [7, None, 9], [None, None, None]]}
df = pd.DataFrame(data)

def safe_numeric_max(row):
    # Keep only numeric (int/float) values
    numerics = [x for x in row if isinstance(x, (int, float))]  
    return max(numerics) if numerics else None

df['max_val'] = df['list_column'].map(safe_numeric_max)

print(df)
	list_column	        max_val
0	[1, 2, 3]	        3.0
1	[None, 5, 6]	    6.0
2	[7, None, 9]	    9.0
3	[None, None, None]	NaN

In a regular Python approach, I also see no issue

lst  = [7, None, 9]

numerics = [x for x in lst if isinstance(x, (int, float))]  

max(numerics)
9

Can you please help if you know what I might be doing wrong here or suggest an alternative approach?

Thanks

Hi @Damilola,

Since the introduction of PyArrow, the default behavior is to force a PyArrow[String] type for complex columns. You just need to disable this behavior:

from dask import config
config.set({"dataframe.convert-string": False})
ddf = dd.from_pandas(df, npartitions=1)
ddf['max_val'] = ddf['list_column'].map(safe_numeric_max, meta=('max_val', 'float64'))
ddf.compute()

Thanks for the response and the suggestion.

With the suggestion you have provided, I am now able to get it working

import dask
import dask.dataframe as dd
import pandas as pd

dask.config.set({"dataframe.convert-string": False})

# Sample data
data = {'list_column': [[1, 2, 3], [None, 5, 6], [7, None, 9], [None, None, None]]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=1)

def safe_numeric_max(row):
    # Keep only numeric (int/float) values
    numerics = [x for x in row if isinstance(x, (int, float))]  
    return max(numerics) if numerics else None

ddf['max_val'] = ddf['list_column'].map(safe_numeric_max, meta=('max_val', 'float64'))

ddf.compute()

    list_column	        max_val
0	[1, 2, 3]	        3.0
1	[None, 5, 6]	    6.0
2	[7, None, 9]	    9.0
3	[None, None, None]	NaN
1 Like