Dividing data among workers and downloading data local to a worker

I wanted to know if the below workflow would be possible using dask?

I have a dataset with image paths and another feature (some numeric feature) (lets say for simplicity 4 data points) (this is loaded as a dask dataframe)
s3://mydata/image1.jpg, 1
s3://mydata/image2.jpg, 2
s3://mydata/image3.jpg, 3
s3://mydata/image4.jpg, 4

Now, I divide this dataframe into 2 partitions. (and I have 2 dask workers)

I want to send one partition to one worker. Then, I have a public s3 bucket (or any repository) where these images are stored. I want to download the image to the dask worker (i1, i2 to worker 0 , i3,i4 to worker 1) and then perform map partitions/apply on each row of the partition. So, each worker’s data will be on the machine the worker will be running in.

Is such a workflow possible using Dask?

Hi @vigneshn1997! Would you be able to provide a copy paste-able minimal reproducer? It’d be helpful to know what you’ve tried and how it works currently, rather than us writing up a minimal example that only guesses at what you’re doing. Additionally, this question sounds similar to other questions you’ve asked (e.g. Dask data sharding and Shuffle and shard dask dataframe), perhaps the replies shared there could help you here?

Hi @scharlottej13 . Yes, I noticed later that the previously asked questions will help me here. My apologies for the repeated question. Thank you for your help :smile:

2 Likes

Of course! And feel free to share what ends up working for you here too! I’m sure others would benefit :slight_smile:

2 Likes