Median with masked array behaves unexpectedly

Campostrini · November 13, 2021, 3:32pm

I have encountered an issue but I would like to have an informal opinion before adding it to GitHub (maybe it’s expected behavior)it has to do with percentiles and masked arrays. Here is the gist https://gist.github.com/Campostrini/fbed9e033ab26b90fb4c885c10419dcb you need py38 for the self documenting strings

import dask.array as da
import numpy as np
    
r1 = list(range(9))
r2 = list(range(8))
r2.append(np.nan)

dar1 = da.from_array(r1)
dar2 = da.from_array(r2)

masked_dar1 = da.ma.masked_where(dar1 > 7, dar1)
masked_dar2 = da.ma.masked_where(da.isnan(dar2), dar2)

median1 = da.median(dar1, axis=0).compute()
median2 = da.median(dar2, axis=0).compute()

median1_masked = da.median(masked_dar1, axis=0).compute()
median2_masked = da.median(masked_dar2, axis=0).compute()

median1_p50 = da.percentile(dar1, 50).compute()
median2_p50 = da.percentile(dar2, 50).compute()

median1_p50_masked = da.percentile(masked_dar1, 50).compute()
median2_p50_masked = da.percentile(masked_dar2, 50).compute()

print(f"{dar1.compute()=} \n{dar2.compute()=} \n{masked_dar1.compute()=} \n{masked_dar2.compute()=}")
print(f"{median1=} {median2=} {median1_masked=} {median2_masked=} {median1_p50=} {median2_p50=}")
print(f"{median1_p50_masked=} {median2_p50_masked=}")

# expected behaviour
r3 = list(range(8))
dar3 = da.from_array(r3)
expected_median = da.median(dar3, axis=0).compute()
expected_median_p50 = da.percentile(dar3, 50).compute()

print(f"Expectations with masks: {expected_median=} {expected_median_p50=}")

# how to get around
alternative_median = da.median(dar1[~da.ma.getmaskarray(masked_dar1)], axis=0).compute()
print(f"Pseudo fix: {alternative_median=}")

ian · November 16, 2021, 8:25pm

Hi @Campostrini, thanks for the snippet and for moving your question here.

Can you elaborate a bit more on what you are trying to do? There’s a lot going on in your gist, and some more context would be helpful.

As I understand, you are getting surprising results when dealing with masked Dask arrays. The first thing I’d note is that a lot of the same behavior happens if you use regular NumPy arrays. For example:

import numpy as np
    
r1 = list(range(9))
r2 = list(range(8))
r2.append(np.nan)
masked_r1 = np.ma.masked_where(np.array(r1) > 7, r1)
masked_r2 = np.ma.masked_where(np.isnan(r2), r2)
print(np.median(r1))
print(np.median(r2))

produces 4.0 and nan, when we wanted 3.5. It also produces a couple of warnings that the mask is being ignored. However, numpy has masked-array-specific versions of many functions, so the following produces the expected result:

print(np.ma.median(r1))
print(np.ma.median(r2))

I’d also direct you towards versions of median and percentile that explicitly ignore nans: np.nanmedian and np.nanpercentile.

Unfortunately, Dask’s coverage of the numpy APIs is incomplete, so not all of the above have analogues. For instance, da.nanmedian exists, but da.nanpercentile does not. da.ma.average has been implemented, but da.ma.median has not. This is where having a bit more context about what you are trying to accomplish would be helpful (does one of the existing methods suit your need, or is it missing?). If what you want is a median along an axis, taking into account nans, the following may be enough:

print(da.nanmedian(dar2, axis=0).compute())

But if you need full support for percentiles with arbitrary chunking and masked arrays, there may need to be more work.

Also, a warning about Dask’s median implementation: it only works by having chunks fully contain the axis along which you are computing the median (that is to say, each median can be computed locally and doesn’t need data on other dask workers). See this issue for more discussion.

Topic		Replies	Views
Da.random.permutation on masked array Dask Array dask-array	6	609	January 19, 2022
Constructing a sparse dask array from Numpy arrays Dask Array	1	184	September 7, 2023
Help with dask/zarr usage -- performance issues with dask Dask Array zarr	10	1923	January 17, 2023
Indexing a dask array with a boolean array Dask Array dask-array	2	1717	May 19, 2022
Dask array, twice delayed Dask Array dask-array , distributed	6	788	February 23, 2022

Median with masked array behaves unexpectedly

Related topics