Shuffle and shard dask dataframe

Right, so for my use case, after sharding and shuffling I wanted to do a data parallel operation (described in Dividing data among workers and downloading data local to a worker). i.e. I want to preprocess every row of the dataset.

I read more in the dask documentation, and was thinking I can use map_partitions function to let dask handle the rows instead of dividing the rows manually.

1 Like