How to append to a Dask Dataframe

As said above, if your input DataFrame doesn’t fit into memory, I don’t see what you can do except using persist before the for loop if the DataFrame fits in the distributed cluster memory. This would avoid to read back the data for every drop_duplicates call!

The thing is drop_duplicates is already running in parallel, so it would a bit difficult and dangerous to also try to run the for loop in parallel.