Row processing: map_partitions vs apply

vigneshn1997 · March 26, 2022, 11:30pm

I am trying to process a table row-wise (applying the same pre-processing for each row). I am able to get it working with apply function but when trying to run with map partitions, I am getting errors and not able to get why the error is coming up. Can’t map partitions do processing row-wise? This is the minimal code for the same

import numpy as np
import pandas as pd
import dask.dataframe as dd

def inc10(x):
    return x + 10

def inc100(x):
    return x + 100

def process_row(row):
    proc_x = inc10(row['x'])
    proc_y = inc100(row['y'])
    return [proc_x, proc_y]

df = pd.DataFrame({'x': range(150), 'y': range(150)})
ddf = dd.from_pandas(df, npartitions=1)

# this works
proc_df = ddf.apply(lambda row: process_row(row), axis=1, meta=('proc_df', object))
proc_df.compute()

# this gives an error (KeyError: 'x')
proc_df_mp = ddf.map_partitions(lambda part : part.apply(lambda row: process_row(row)), meta=('proc_df_mp', object))
proc_df_mp.compute()

pavithraes · March 28, 2022, 11:57am

@vigneshn1997 In your map_partitions statement, you need to set axis=1:

proc_df_mp = ddf.map_partitions(lambda part : part.apply(lambda row: process_row(row), axis=1), meta=('proc_df_mp', object))

Since map_partitions works on the internal pandas DataFrames, “pandas” apply is being used here, which defaults to axis=0, that’s why you get: KeyError: 'x'

Topic		Replies	Views
Map_partitions question for image processing Dask DataFrame	6	803	February 21, 2022
Doubts related Dask dataframe Dask DataFrame	3	380	February 14, 2022
Issue in Parallel row preprocessing with Dask Dask DataFrame kubernetes , distributed	2	495	August 6, 2022
Map_partitions just to execute and save per partition Dask DataFrame	0	442	September 28, 2022
How to parallel process .apply with a lambda function within a for loop? Dask DataFrame	2	395	February 28, 2023

Row processing: map_partitions vs apply

Related topics