I have a huge list of string IDs, and they are stored as csv files.
After read from csv as Dask dataframe, i ran these two calls. The results between these two calls are actually different. The result of drop_duplicates using p2p shuffle completely removed some duplicates, and doesn’t keep first or last record of the duplicates.
The result of drop_duplicates using tasks shuffle looks correct result. How and Why it happened? How can I debug this?
n_rows = ddf.astype({“nodes”:“string[pyarrow]”}).drop_duplicates(subset=[“nodes”], shuffle_method=“p2p”).shape[0].compute()
print(n_rows)
n_rows = ddf.astype({“nodes”:“string[pyarrow]”}).drop_duplicates(subset=[“nodes”], shuffle_method=“tasks”).shape[0].compute()
print(n_rows)