I have been trying to sample rows from a dataframe using sample(frac)and realised that it does only work properly if the parameter frac >= npartitions / nrows.
import pandas as pd
import dask.dataframe as dd
df = dd.from_pandas(pd.DataFrame({'test':range(0,100)}), npartitions= 10)
print(f'Expectation | Result')
print(f'--------------------')
for numerator in (1, 5, 9, 10, 20, 30, 50):
sample = df.sample(frac= numerator/100)
print(f'{numerator:^11} | {len(sample):^7}')
Is this behaviour intended and if so, is there a work around for the case frac < npartitions / nrows?
Hi @sissnad and welcome! Thanks for taking the time to provide a minimal reproducer, this was very easy to investigate!
This seems to be expected behavior, though it is not well documented. For comparison, in Pandas, an empty dataframe is returned only if frac is less than the number of rows:
import pandas as pd
# returns an empty dataframe for any value of frac < 1/100
pd.DataFrame({'test':range(0,100)}).sample(frac=1/1000)
Since sample operates on each partition, it will similarly return an empty dataframe if frac * nrows is less than 1, per partition. I encourage you to open up an issue, improving the docstring for this function would be very helpful!
I think, your pandas example does not compare 1 to 1. Sampling with frac = 1/1000 of a dataframe with 100 rows, should return an empty sample. The smallest possible non empty sample size is 1, corresponding to frac = 1/100. And in that case pandas correctly returns a sample of sample size 1:
import pandas as pd
# returns an empty dataframe for any value of frac < 1/100
pd.DataFrame({'test':range(0,100)}).sample(frac=1/100)
To me it would be unnatural and confusing if frac in dask referred to:
the fraction per partition for frac = n / nrows for n < npartitions
and to the entire dataframe for frac = n / nrows for n >= npartitions.
I haven’t looked into the code yet. But in my opinion if I sample with frac = n / nrows for n < npartitions, and the process in dask requires to chose at least 1 sample per partition, then sample(frac = n/nrows) has to randomly drop npartitions - n samples.
Hi @sissnad! Ah yes, my apologies with the pandas example, I see your point.
Here is the implementation of sample. We can also use visualize to get a better sense of what’s happening, which shows sample being called in parallel on each partition of the dataframe:
import pandas as pd
import dask.dataframe as dd
df = dd.from_pandas(pd.DataFrame({'test':range(0,100)}), npartitions=10)
df.sample(frac=1/100).visualize()