I have a DataFrame with a column named idx
that is in sorted order from 0 to 15,240,191 and I have a list subsamples = [1, 3, 5,..., 15240188]
that correspond to idx
values I want from the DataFrame to create a new DataFrame. I know that I cannot use iloc
like I would with Pandas. Is there a reasonable workaround? Everything I’ve tried so far has resulted in loading the entire dataset into memory.
I apologize if I am missing basic documentation. I’m very new to Dask.
Hi @agdenadel, welcome to Dask discourse,
How do you load this Dataframe? Is idx
the real dataframe index?
I think of two possibilities:
- Make your
subsamples
a Dataframe and performe a join/merge with it on the input one. This should be efficient if your input dataframe is indexed properly.
- If
subsamples
is not too big, just use map_partitions with it in argument, and perform a selection on the Pandas partitions.
What did you try so far?
Thanks you @guillaumeeb, this was very helpful. My solution ended up being
subsampled_df = pd.DataFrame(subsamples, columns = ['idx'])
subsampled_df = dd.from_pandas(subsampled_df, npartitions=1)
subsetted_df = my_df.merge(subsampled_df, on="idx")
it seems like this could be a practical replacement for iloc
when wishing to select specific rows.
1 Like