Hey Team, We are trying to do deduplication in dask using drop_duplicates. initially it was working fine when we were deduping one source at a time. but started throwing worker connection timeout issues when we started running dedup in parallel for multiple sources. Can someone give some idea what can be the root cause here.
we had a look at metrics too we dont see any major memory/cpu spikes too. dont know whats causing this connection failures while client creation. Any leads on this would be very helpful.
Hi @shivammmmm,
Could you share a reproducer of your issue?