Reading GB sized csv and immediately entering in an endless repartitioning loop

Hello! I have tried to find this answer somewhere else, but have not found this issue previously reported.

I am trying to load a moderate size csv file into a dask dataframe (~5.8GB) with dask.dataframe.read_csv. The lazy evaluation goes fine, but then, after requesting the object with .compute(), I see an endless loop of repartitiontofewer tasks , and it doesn´t finish loading.

I feel that I am probably doing something wrong, as it should be a fairly simple task. I am processing this in a jobqueue.SLURMCluster with 5 workers, each with 29GB of RAM each, so it should be ok. Any comments on how can I debug/fix this?

Hi @rlourenco, welcome to Dask Discourse!

Could you provide some code snippet of what you are doing? Do you perform any operation on the Dask Dataframe before calling compute()?

Hi @guillaumeeb ! Nice to see you here too :slight_smile:

Basically fairly simple setup, an adaptive SLURMCluster with 5-10 nodes, as follows:

node_share=8
cluster = SLURMCluster(
    processes=1, # Do not increase processes!
    cores=int(64/node_share), 
    memory=str(int(249/node_share))+'GB',
    interface='ib0',
    death_timeout=(60*5),
    local_directory='$SLURM_TMPDIR',#'/home/lourenco/scratch/dask_journaling_cache',
    log_directory='/home/lourenco/scratch/dask_slurm_logs',
    worker_extra_args=['--lifetime','55m',
                       '--lifetime-stagger', '4m',
                       '--lifetime-restart',
                      ],
    # job_extra_directives=['--exclusive',
    #         ],
    account="def-ggalex",
    walltime='01:01:00',
    
)

And running dask.dataframe.read_csv . That’s all.

I have “solved” my issue by replacing the CSV with a parquet file (actually, both of them were dumps from a Xarray dataset, with 105 million lines and 9 columns). But I don’t understand why I was having issues with the CSV load. Perhaps chunking? I tested values below 100MB, as recommended, but didn’t improve that constant rebalancing issue.

I was reading this blog post by Coiled, and perhaps it is related with the difference in data structures when loading CSV vs parquet:

My pleasure!

Although it’s understandable that using Parquet will be faster, and it is clearly recommended, I don’t see a good reason explaining why it would fail using a CSV file, 6GB is not that big. Do you have a single file or multiple files?

Yes. It dazzled me too. It is a single file, dumped from an Xarray Dataset. I wonder if would be the data types on the csv getting mismatched, although I would believe that would cause an error while loading, not a rebalancing issue.