Distributed dask dataframe sample reproducibility

miltava · August 23, 2023, 12:00am

Hello, everyone.

What’s the best way to achieve pseudo-randomness reproducibility in the distributed setting?

How does dask dataframe work when random_state is set? Will it always give the same result to the same dataframe and random_state? Does each partition use its own “seed”? What happens when the number of partitions is different?

Thanks for you help

Milton

guillaumeeb · August 23, 2023, 3:11pm

Hi @miltava,

Could you be a bit more precise on what you are trying to achieve? Which DataFrame function are you referring to? Could you share some sample code?

guillaumeeb · August 25, 2023, 7:53pm

Sorry, I did not pay attention to the sample in the title, so I guess you are talking about DataFrame.sample function.

In this case, according to the code, if you have the same DataFrame, which means with the same data and the same partitions (number and shape), then sample will always give the same result. The random_state is used to create a sub state for each partition (https://github.com/dask/dask/blob/main/dask/utils.py#L410).

Then you probably won’t have the same result.

miltava · September 7, 2023, 8:41pm

Hey Guillaume.

Thanks for the answer. That’s exactly what I was looking for.

Milton

Topic		Replies	Views
Ensure Deterministic Partitions for Machine Learning Dask DataFrame distributed , dask-ml	4	683	August 14, 2023
Dask shuffling between partitions Dask DataFrame	8	1170	February 22, 2022
Shuffle and shard dask dataframe Dask DataFrame	7	726	February 9, 2022
How groubyied Dataframe works? Dask DataFrame distributed	6	283	February 17, 2023
Best practice to distribute Distributed distributed	5	548	January 11, 2022

Distributed dask dataframe sample reproducibility

Related topics