Median with masked array behaves unexpectedly

I have encountered an issue but I would like to have an informal opinion before adding it to GitHub (maybe it’s expected behavior)it has to do with percentiles and masked arrays. Here is the gist https://gist.github.com/Campostrini/fbed9e033ab26b90fb4c885c10419dcb you need py38 for the self documenting strings

import dask.array as da
import numpy as np
    
r1 = list(range(9))
r2 = list(range(8))
r2.append(np.nan)

dar1 = da.from_array(r1)
dar2 = da.from_array(r2)

masked_dar1 = da.ma.masked_where(dar1 > 7, dar1)
masked_dar2 = da.ma.masked_where(da.isnan(dar2), dar2)

median1 = da.median(dar1, axis=0).compute()
median2 = da.median(dar2, axis=0).compute()

median1_masked = da.median(masked_dar1, axis=0).compute()
median2_masked = da.median(masked_dar2, axis=0).compute()

median1_p50 = da.percentile(dar1, 50).compute()
median2_p50 = da.percentile(dar2, 50).compute()

median1_p50_masked = da.percentile(masked_dar1, 50).compute()
median2_p50_masked = da.percentile(masked_dar2, 50).compute()

print(f"{dar1.compute()=} \n{dar2.compute()=} \n{masked_dar1.compute()=} \n{masked_dar2.compute()=}")
print(f"{median1=} {median2=} {median1_masked=} {median2_masked=} {median1_p50=} {median2_p50=}")
print(f"{median1_p50_masked=} {median2_p50_masked=}")

# expected behaviour
r3 = list(range(8))
dar3 = da.from_array(r3)
expected_median = da.median(dar3, axis=0).compute()
expected_median_p50 = da.percentile(dar3, 50).compute()

print(f"Expectations with masks: {expected_median=} {expected_median_p50=}")

# how to get around
alternative_median = da.median(dar1[~da.ma.getmaskarray(masked_dar1)], axis=0).compute()
print(f"Pseudo fix: {alternative_median=}")

Hi @Campostrini, thanks for the snippet and for moving your question here.

Can you elaborate a bit more on what you are trying to do? There’s a lot going on in your gist, and some more context would be helpful.

As I understand, you are getting surprising results when dealing with masked Dask arrays. The first thing I’d note is that a lot of the same behavior happens if you use regular NumPy arrays. For example:

import numpy as np
    
r1 = list(range(9))
r2 = list(range(8))
r2.append(np.nan)
masked_r1 = np.ma.masked_where(np.array(r1) > 7, r1)
masked_r2 = np.ma.masked_where(np.isnan(r2), r2)
print(np.median(r1))
print(np.median(r2))

produces 4.0 and nan, when we wanted 3.5. It also produces a couple of warnings that the mask is being ignored. However, numpy has masked-array-specific versions of many functions, so the following produces the expected result:

print(np.ma.median(r1))
print(np.ma.median(r2))

I’d also direct you towards versions of median and percentile that explicitly ignore nans: np.nanmedian and np.nanpercentile.

Unfortunately, Dask’s coverage of the numpy APIs is incomplete, so not all of the above have analogues. For instance, da.nanmedian exists, but da.nanpercentile does not. da.ma.average has been implemented, but da.ma.median has not. This is where having a bit more context about what you are trying to accomplish would be helpful (does one of the existing methods suit your need, or is it missing?). If what you want is a median along an axis, taking into account nans, the following may be enough:

print(da.nanmedian(dar2, axis=0).compute())

But if you need full support for percentiles with arbitrary chunking and masked arrays, there may need to be more work.

Also, a warning about Dask’s median implementation: it only works by having chunks fully contain the axis along which you are computing the median (that is to say, each median can be computed locally and doesn’t need data on other dask workers). See this issue for more discussion.

2 Likes