Subsetting Dask DataFrame based on a column

agdenadel · March 27, 2024, 12:47am

I have a DataFrame with a column named idx that is in sorted order from 0 to 15,240,191 and I have a list subsamples = [1, 3, 5,..., 15240188] that correspond to idx values I want from the DataFrame to create a new DataFrame. I know that I cannot use iloc like I would with Pandas. Is there a reasonable workaround? Everything I’ve tried so far has resulted in loading the entire dataset into memory.

I apologize if I am missing basic documentation. I’m very new to Dask.

guillaumeeb · March 27, 2024, 3:47pm

Hi @agdenadel, welcome to Dask discourse,

How do you load this Dataframe? Is idx the real dataframe index?

I think of two possibilities:

Make your subsamples a Dataframe and performe a join/merge with it on the input one. This should be efficient if your input dataframe is indexed properly.
If subsamples is not too big, just use map_partitions with it in argument, and perform a selection on the Pandas partitions.

What did you try so far?

agdenadel · March 28, 2024, 12:04am

Thanks you @guillaumeeb, this was very helpful. My solution ended up being

subsampled_df = pd.DataFrame(subsamples, columns = ['idx'])
subsampled_df = dd.from_pandas(subsampled_df, npartitions=1)
subsetted_df = my_df.merge(subsampled_df, on="idx")

it seems like this could be a practical replacement for iloc when wishing to select specific rows.

Topic		Replies	Views
How fetch rows from another Dask dataframe by matching Dask dataframe's ID columns? Dask DataFrame	2	93	April 1, 2024
Slicing a dask array with a dask dataframe in one compute Dask Array dask-array , distributed	6	1569	January 14, 2022
Best way to partition a dataframe respecting boundaries of row subgroups Dask DataFrame	1	211	April 28, 2022
What do I do here since I can't use iloc? Dask DataFrame	1	126	June 20, 2023
Filtering big dataframe by index Dask DataFrame indexing	5	400	May 30, 2024

Subsetting Dask DataFrame based on a column

Related topics