Hello! I have tried to find this answer somewhere else, but have not found this issue previously reported.
I am trying to load a moderate size csv file into a dask dataframe (~5.8GB) with dask.dataframe.read_csv. The lazy evaluation goes fine, but then, after requesting the object with .compute(), I see an endless loop of repartitiontofewer tasks , and it doesn´t finish loading.
I feel that I am probably doing something wrong, as it should be a fairly simple task. I am processing this in a jobqueue.SLURMCluster with 5 workers, each with 29GB of RAM each, so it should be ok. Any comments on how can I debug/fix this?
I have “solved” my issue by replacing the CSV with a parquet file (actually, both of them were dumps from a Xarray dataset, with 105 million lines and 9 columns). But I don’t understand why I was having issues with the CSV load. Perhaps chunking? I tested values below 100MB, as recommended, but didn’t improve that constant rebalancing issue.
Although it’s understandable that using Parquet will be faster, and it is clearly recommended, I don’t see a good reason explaining why it would fail using a CSV file, 6GB is not that big. Do you have a single file or multiple files?
Yes. It dazzled me too. It is a single file, dumped from an Xarray Dataset. I wonder if would be the data types on the csv getting mismatched, although I would believe that would cause an error while loading, not a rebalancing issue.