Distributed dask dataframe sample reproducibility

Hello, everyone.

What’s the best way to achieve pseudo-randomness reproducibility in the distributed setting?

How does dask dataframe work when random_state is set? Will it always give the same result to the same dataframe and random_state? Does each partition use its own “seed”? What happens when the number of partitions is different?

Thanks for you help

Milton

Hi @miltava,

Could you be a bit more precise on what you are trying to achieve? Which DataFrame function are you referring to? Could you share some sample code?

Sorry, I did not pay attention to the sample in the title, so I guess you are talking about DataFrame.sample function.

In this case, according to the code, if you have the same DataFrame, which means with the same data and the same partitions (number and shape), then sample will always give the same result. The random_state is used to create a sub state for each partition (https://github.com/dask/dask/blob/main/dask/utils.py#L410).

Then you probably won’t have the same result.

Hey Guillaume.

Thanks for the answer. That’s exactly what I was looking for.

Milton