Map_partitions question for image processing

vigneshn1997 · February 14, 2022, 5:15pm

I am facing issues with using map_partitions

import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
from dask.distributed import progress
from PIL import Image

client = Client()
DATA_URL = "metadata.csv"
feature_names = ["image_path", "label"]
dtypes = {'image_path': np.str, 'label': np.int16}
df0 = dd.read_csv(DATA_URL, names=feature_names, dtype=dtypes)
df = df0.sample(frac=0.001)
df.head()

new_df = df.repartition(npartitions=2)

def preprocess(path):
    im = Image.open(str(path))
    pixels = list(im.getdata())
    return sum(pixels)

# this works for me(just calling apply on every image_path and getting the sum of pixels for every image)
sum_col = new_df.image_path.apply(lambda x: preprocess(str(x)), meta=np.int)
sum_col.head()

# but why doesn't map partitions work
sum_col = new_df.image_path.map_partitions(lambda x: preprocess(str(x)), meta=np.int)
sum_col.head()

Why is map_partitions taking the entire dataframe as a single string, while apply is taking each row separately?

scharlottej13 · February 14, 2022, 7:35pm

Hi @vigneshn1997, thanks for the question! I separated it into a new topic for clarity.

With apply, the lambda function is applying preprocess(str()) to each value in the new_df.image_path series (as you correctly expected). For map_partitions, it’s applying preprocess(str()) to a whole series, separately for each partition. Therefore, you need an additional function instructing Dask to apply your function to each value in the series. Here’s a minimal example:

import pandas as pd
import dask.dataframe as dd
from dask.distributed import Client

# use the distributed client
client = Client()

# create simple dask dataframe
ddf = dd.from_pandas(
    pd.DataFrame({'image_path': ['x.jpg'] * 10, 'label': range(1,11)}),
    npartitions=2
)

# simplification of your function
def preprocess(path):
    return len(path)

# expected result, using apply
sum_col1 = ddf.image_path.apply(lambda x: preprocess(str(x)), meta=('sum_col1', int))
sum_col1.head()

# not what we want, sum_col2 is a 2-element series of '86'...
sum_col2 = ddf.image_path.map_partitions(lambda x: preprocess(str(x)), meta=('sum_col2', int))
sum_col2.compute()
# returns a 'string-ified' version of the image_path series, with a length of 86
len(ddf.image_path.map_partitions(lambda x: str(x)).partitions[0].compute()[0])

# expected result, using map_partitions
sum_col3 = ddf.image_path.map_partitions(
    lambda x: x.apply(lambda y: preprocess(str(y))),
    meta=('sum_col3', int)
)
sum_col3.head()

It’s also worth noting that passing meta=int will not work in the future-- if you’re using the latest version of Dask you’ll notice:

FutureWarning: Meta is not valid, `map_partitions` expects output to be a pandas object. Try passing a pandas object as meta or a dict or tuple representing the (name, dtype) of the columns. In the future the meta you passed will not work.

vigneshn1997 · February 14, 2022, 8:56pm

Thank you very much @scharlottej13 for the clarification. I wanted to know if there is some way to control the map_partition computation to run on a worker. Can I use client.submit to submit the map_partition call on a specific worker?

This is so that I don’t have to call .compute to materialize the computation on the scheduler.

scharlottej13 · February 14, 2022, 11:12pm

No problem @vigneshn1997! I’d be curious to hear more about why you’d like to avoid submitting work to the client? When a Client is instantiated, it automatically becomes the default for running Dask collections (e.g. map_partitions) and will distribute tasks to the available workers (more on this here). There’s more here on managing computation, including asynchronous computation, perhaps this is what you’re looking for?

vigneshn1997 · February 15, 2022, 1:34am

So each worker will have a different partition of images and I want a worker to perform only on its partition of images (because if it tries to access some other image it will get a file not found error). I am trying to achieve a data-parallel setup using this.

scharlottej13 · February 15, 2022, 6:01pm

Have you already tried using the workers parameter of Client.compute?

vigneshn1997 · February 21, 2022, 7:54pm

Yes I was able to assign tasks to workers using workers parameter.

Topic		Replies	Views
Row processing: map_partitions vs apply	1	321	March 28, 2022
Doubts related Dask dataframe Dask DataFrame	3	384	February 14, 2022
Map_partition function to apply a plotting function on partitions Dask DataFrame	0	155	August 30, 2022
Map_partitions just to execute and save per partition Dask DataFrame	0	463	September 28, 2022
Issue in Parallel row preprocessing with Dask Dask DataFrame kubernetes , distributed	2	505	August 6, 2022

Map_partitions question for image processing

Related topics