The result of p2p shuffle drop_duplicates is different than the result of tasks shuffle drop_duplicates

I have a huge list of string IDs, and they are stored as csv files.

After read from csv as Dask dataframe, i ran these two calls. The results between these two calls are actually different. The result of drop_duplicates using p2p shuffle completely removed some duplicates, and doesn’t keep first or last record of the duplicates.

The result of drop_duplicates using tasks shuffle looks correct result. How and Why it happened? How can I debug this?

n_rows = ddf.astype({“nodes”:“string[pyarrow]”}).drop_duplicates(subset=[“nodes”], shuffle_method=“p2p”).shape[0].compute()

print(n_rows)

n_rows = ddf.astype({“nodes”:“string[pyarrow]”}).drop_duplicates(subset=[“nodes”], shuffle_method=“tasks”).shape[0].compute()

print(n_rows)