Sampling from Dataframe for frac < npartitions / nrows

Hey there,

I have been trying to sample rows from a dataframe using sample(frac)and realised that it does only work properly if the parameter frac >= npartitions / nrows.

import pandas as pd
import dask.dataframe as dd

df = dd.from_pandas(pd.DataFrame({'test':range(0,100)}), npartitions= 10)
print(f'Expectation | Result')
print(f'--------------------')
for numerator in (1, 5, 9, 10, 20, 30, 50):
    sample = df.sample(frac= numerator/100)
    print(f'{numerator:^11} | {len(sample):^7}')

Is this behaviour intended and if so, is there a work around for the case frac < npartitions / nrows?

Hi @sissnad and welcome! Thanks for taking the time to provide a minimal reproducer, this was very easy to investigate!

This seems to be expected behavior, though it is not well documented. For comparison, in Pandas, an empty dataframe is returned only if frac is less than the number of rows:

import pandas as pd
# returns an empty dataframe for any value of frac < 1/100
pd.DataFrame({'test':range(0,100)}).sample(frac=1/1000)

Since sample operates on each partition, it will similarly return an empty dataframe if frac * nrows is less than 1, per partition. I encourage you to open up an issue, improving the docstring for this function would be very helpful!

Hi @scharlottej13, thanks for your reply.

I think, your pandas example does not compare 1 to 1. Sampling with frac = 1/1000 of a dataframe with 100 rows, should return an empty sample. The smallest possible non empty sample size is 1, corresponding to frac = 1/100. And in that case pandas correctly returns a sample of sample size 1:

import pandas as pd
# returns an empty dataframe for any value of frac < 1/100
pd.DataFrame({'test':range(0,100)}).sample(frac=1/100)

To me it would be unnatural and confusing if frac in dask referred to:

  • the fraction per partition for frac = n / nrows for n < npartitions
  • and to the entire dataframe for frac = n / nrows for n >= npartitions.

I haven’t looked into the code yet. But in my opinion if I sample with frac = n / nrows for n < npartitions, and the process in dask requires to chose at least 1 sample per partition, then sample(frac = n/nrows) has to randomly drop npartitions - n samples.

In any case, I am happy to open up an issue.

Hi @sissnad! Ah yes, my apologies with the pandas example, I see your point.

Here is the implementation of sample. We can also use visualize to get a better sense of what’s happening, which shows sample being called in parallel on each partition of the dataframe:

import pandas as pd
import dask.dataframe as dd
df = dd.from_pandas(pd.DataFrame({'test':range(0,100)}), npartitions=10)
df.sample(frac=1/100).visualize()

In any case, I agree that sample is not behaving as you’d expect and thanks in advance for opening up an issue on this!

1 Like

Thanks @sissnad for opening up an issue!

1 Like