Shuffle and shard dask dataframe

vigneshn1997 · February 9, 2022, 4:19pm

Right, so for my use case, after sharding and shuffling I wanted to do a data parallel operation (described in Dividing data among workers and downloading data local to a worker). i.e. I want to preprocess every row of the dataset.

I read more in the dask documentation, and was thinking I can use map_partitions function to let dask handle the rows instead of dividing the rows manually.

Topic		Replies	Views
Dask shuffling between partitions Dask DataFrame	8	1175	February 22, 2022
Dividing data among workers and downloading data local to a worker Dask DataFrame	3	401	February 11, 2022
What's the most efficient way to fetch a shuffled dataframe in batches for ML training? Dask DataFrame delayed , distributed	3	648	March 21, 2022
How does Dask determine partitions? Dask DataFrame partitioning , distributed	2	551	January 24, 2023
Operations on a partitioned DataFrame not actually distributed across workers Dask DataFrame distributed	4	325	May 13, 2022

Shuffle and shard dask dataframe

Related topics