How to explode a dask dataframe column horizontally instead of vertically

I am trying to emulate the below behavior using Dask Dataframe to explode the columns horizontally instead of vertically

I am able to achieve it using the below code in Pandas dataframe

import pandas as pd

data = {‘list_column’: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]}

df = pd.DataFrame(data)

df_exploded = df.explode(‘list_column’)

df_horizontal = df_exploded.pivot_table(index=df_exploded.index, columns=df_exploded.groupby(df_exploded.index).cumcount(), values=‘list_column’)

df_horizontal

Hi @Damilola,

If you know before hand the number of int in the list, you could go with map_partitions:

import dask
import pandas as pd
import dask.dataframe as ddf
dask.config.set({"dataframe.convert-string": False})

data = {'list_column': [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]]}
df = pd.DataFrame(data)
dask_df = ddf.from_pandas(df, npartitions=3)

def explode_horizontal(df):
    df_exploded = df.explode('list_column')
    df_horizontal = df_exploded.pivot_table(index=df_exploded.index, columns=df_exploded.groupby(df_exploded.index).cumcount(), values='list_column')
    return df_horizontal

result = dask_df.map_partitions(explode_horizontal, meta={0:'int', 1: 'int', 2: 'int'})#meta is key here

result.compute()
1 Like

Hi @guillaumeeb

Thanks for the response.

It appears you are exploding the Pandas dataframe and not the Dask dataframe.

I also want a solution that will work if the length of the list values is arbitrary or not known in advance as that could impact how the meta parameter is constructed

Please note that all list entries in a column will have the same length. However, this length can vary and cannot always be deterministic in advance.

This is the current Pandas dataframe that works

Pandas DataFrame Approach

import pandas as pd

data = {‘list_column’: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]}

df = pd.DataFrame(data)

df_exploded = df.explode(‘list_column’)

df_horizontal = df_exploded.pivot_table(index=df_exploded.index, columns=df_exploded.groupby(df_exploded.index).cumcount(), values=‘list_column’)

df_horizontal

However, from the pure Dask Dataframe point of view, this is what I am actually looking for which I have not been able to get to work as I am always getting a ValueError: Grouper and axis must be same length

Dask DataFrame Approach

import pandas as pd
import dask.dataframe as dd

data = {‘list_column’: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]}

pdf = pd.DataFrame(data)

ddf = dd.from_pandas(pdf, npartitions=1)

ddf_exploded = ddf.explode(‘list_column’)

ddf_horizontal = ddf_exploded.pivot_table(index=ddf_exploded.index, columns=ddf_exploded.groupby(df_exploded.index).cumcount(), values=‘list_column’)

ddf_horizontal.compute()

Thank you so much for your help in this regard.

Yes I am, because I’m using the Pandas method on each partition of the Dask DataFrame, using map_partitions.

I understand, but I don’t think this is possible, you’ve got to specify meta kwarg to Dask. Could you just read a few lines of data to determine the length of the list values to build the meta kwarg?