I have a DataFrame with a column named idx that is in sorted order from 0 to 15,240,191 and I have a list subsamples = [1, 3, 5,..., 15240188] that correspond to idx values I want from the DataFrame to create a new DataFrame. I know that I cannot use iloc like I would with Pandas. Is there a reasonable workaround? Everything I’ve tried so far has resulted in loading the entire dataset into memory.
I apologize if I am missing basic documentation. I’m very new to Dask.
Hi @agdenadel, welcome to Dask discourse,
How do you load this Dataframe? Is idx the real dataframe index?
I think of two possibilities:
- Make your
subsamples a Dataframe and performe a join/merge with it on the input one. This should be efficient if your input dataframe is indexed properly.
- If
subsamples is not too big, just use map_partitions with it in argument, and perform a selection on the Pandas partitions.
What did you try so far?
Thanks you @guillaumeeb, this was very helpful. My solution ended up being
subsampled_df = pd.DataFrame(subsamples, columns = ['idx'])
subsampled_df = dd.from_pandas(subsampled_df, npartitions=1)
subsetted_df = my_df.merge(subsampled_df, on="idx")
it seems like this could be a practical replacement for iloc when wishing to select specific rows.
2 Likes