Sampling from Dataframe for frac < npartitions / nrows

sissnad · February 8, 2022, 2:33pm

Hey there,

I have been trying to sample rows from a dataframe using sample(frac)and realised that it does only work properly if the parameter frac >= npartitions / nrows.

import pandas as pd
import dask.dataframe as dd

df = dd.from_pandas(pd.DataFrame({'test':range(0,100)}), npartitions= 10)
print(f'Expectation | Result')
print(f'--------------------')
for numerator in (1, 5, 9, 10, 20, 30, 50):
    sample = df.sample(frac= numerator/100)
    print(f'{numerator:^11} | {len(sample):^7}')

Is this behaviour intended and if so, is there a work around for the case frac < npartitions / nrows?

scharlottej13 · February 11, 2022, 10:22pm

Hi @sissnad and welcome! Thanks for taking the time to provide a minimal reproducer, this was very easy to investigate!

This seems to be expected behavior, though it is not well documented. For comparison, in Pandas, an empty dataframe is returned only if frac is less than the number of rows:

import pandas as pd
# returns an empty dataframe for any value of frac < 1/100
pd.DataFrame({'test':range(0,100)}).sample(frac=1/1000)

Since sample operates on each partition, it will similarly return an empty dataframe if frac * nrows is less than 1, per partition. I encourage you to open up an issue, improving the docstring for this function would be very helpful!

sissnad · February 12, 2022, 10:04pm

Hi @scharlottej13, thanks for your reply.

I think, your pandas example does not compare 1 to 1. Sampling with frac = 1/1000 of a dataframe with 100 rows, should return an empty sample. The smallest possible non empty sample size is 1, corresponding to frac = 1/100. And in that case pandas correctly returns a sample of sample size 1:

import pandas as pd
# returns an empty dataframe for any value of frac < 1/100
pd.DataFrame({'test':range(0,100)}).sample(frac=1/100)

To me it would be unnatural and confusing if frac in dask referred to:

the fraction per partition for frac = n / nrows for n < npartitions
and to the entire dataframe for frac = n / nrows for n >= npartitions.

I haven’t looked into the code yet. But in my opinion if I sample with frac = n / nrows for n < npartitions, and the process in dask requires to chose at least 1 sample per partition, then sample(frac = n/nrows) has to randomly drop npartitions - n samples.

In any case, I am happy to open up an issue.

scharlottej13 · February 14, 2022, 11:42pm

Hi @sissnad! Ah yes, my apologies with the pandas example, I see your point.

Here is the implementation of sample. We can also use visualize to get a better sense of what’s happening, which shows sample being called in parallel on each partition of the dataframe:

import pandas as pd
import dask.dataframe as dd
df = dd.from_pandas(pd.DataFrame({'test':range(0,100)}), npartitions=10)
df.sample(frac=1/100).visualize()

In any case, I agree that sample is not behaving as you’d expect and thanks in advance for opening up an issue on this!

scharlottej13 · February 26, 2022, 1:09am

Thanks @sissnad for opening up an issue!

Topic		Replies	Views
Is it only me or the description of dask.dataframe.from_pandas() function is misleading? Dask DataFrame	1	114	March 15, 2024
How to implement groupby sampling in dask? Dask DataFrame groupby	1	27	July 25, 2024
DataFrame.loc[[...]].compute() raises KeyError while DataFrame.compute().loc[[...]] doesn't? Dask DataFrame	5	683	December 2, 2023
Why dd.repartition() is using round divisions Dask DataFrame	0	186	November 16, 2022
Distributed dask dataframe sample reproducibility Dask DataFrame distributed	3	271	September 7, 2023

Sampling from Dataframe for frac < npartitions / nrows

Related topics