Dividing data among workers and downloading data local to a worker

vigneshn1997 · February 9, 2022, 5:20am

I wanted to know if the below workflow would be possible using dask?

I have a dataset with image paths and another feature (some numeric feature) (lets say for simplicity 4 data points) (this is loaded as a dask dataframe)
s3://mydata/image1.jpg, 1
s3://mydata/image2.jpg, 2
s3://mydata/image3.jpg, 3
s3://mydata/image4.jpg, 4

Now, I divide this dataframe into 2 partitions. (and I have 2 dask workers)

I want to send one partition to one worker. Then, I have a public s3 bucket (or any repository) where these images are stored. I want to download the image to the dask worker (i1, i2 to worker 0 , i3,i4 to worker 1) and then perform map partitions/apply on each row of the partition. So, each worker’s data will be on the machine the worker will be running in.

Is such a workflow possible using Dask?

scharlottej13 · February 11, 2022, 6:50pm

Hi @vigneshn1997! Would you be able to provide a copy paste-able minimal reproducer? It’d be helpful to know what you’ve tried and how it works currently, rather than us writing up a minimal example that only guesses at what you’re doing. Additionally, this question sounds similar to other questions you’ve asked (e.g. Dask data sharding and Shuffle and shard dask dataframe), perhaps the replies shared there could help you here?

vigneshn1997 · February 11, 2022, 6:57pm

Hi @scharlottej13 . Yes, I noticed later that the previously asked questions will help me here. My apologies for the repeated question. Thank you for your help

scharlottej13 · February 11, 2022, 7:16pm

Of course! And feel free to share what ends up working for you here too! I’m sure others would benefit

Topic		Replies	Views
Shuffle and shard dask dataframe Dask DataFrame	7	731	February 9, 2022
Dask saving dataframe partitions as files Dask DataFrame distributed	1	513	May 25, 2022
Reading data (and image data) from HDFS for training	2	242	September 28, 2023
Issue in Parallel row preprocessing with Dask Dask DataFrame kubernetes , distributed	2	507	August 6, 2022
Dask data sharding future	9	570	January 25, 2022

Dividing data among workers and downloading data local to a worker

Related topics