Subsetting Dask DataFrame based on a column

I have a DataFrame with a column named idx that is in sorted order from 0 to 15,240,191 and I have a list subsamples = [1, 3, 5,..., 15240188] that correspond to idx values I want from the DataFrame to create a new DataFrame. I know that I cannot use iloc like I would with Pandas. Is there a reasonable workaround? Everything I’ve tried so far has resulted in loading the entire dataset into memory.

I apologize if I am missing basic documentation. I’m very new to Dask.

Hi @agdenadel, welcome to Dask discourse,

How do you load this Dataframe? Is idx the real dataframe index?

I think of two possibilities:

  • Make your subsamples a Dataframe and performe a join/merge with it on the input one. This should be efficient if your input dataframe is indexed properly.
  • If subsamples is not too big, just use map_partitions with it in argument, and perform a selection on the Pandas partitions.

What did you try so far?

Thanks you @guillaumeeb, this was very helpful. My solution ended up being

subsampled_df = pd.DataFrame(subsamples, columns = ['idx'])
subsampled_df = dd.from_pandas(subsampled_df, npartitions=1)
subsetted_df = my_df.merge(subsampled_df, on="idx")

it seems like this could be a practical replacement for iloc when wishing to select specific rows.

1 Like