Reading GB sized csv and immediately entering in an endless repartitioning loop

rlourenco · March 22, 2024, 11:18pm

Hello! I have tried to find this answer somewhere else, but have not found this issue previously reported.

I am trying to load a moderate size csv file into a dask dataframe (~5.8GB) with dask.dataframe.read_csv. The lazy evaluation goes fine, but then, after requesting the object with .compute(), I see an endless loop of repartitiontofewer tasks , and it doesn´t finish loading.

I feel that I am probably doing something wrong, as it should be a fairly simple task. I am processing this in a jobqueue.SLURMCluster with 5 workers, each with 29GB of RAM each, so it should be ok. Any comments on how can I debug/fix this?

guillaumeeb · March 24, 2024, 5:02pm

Hi @rlourenco, welcome to Dask Discourse!

Could you provide some code snippet of what you are doing? Do you perform any operation on the Dask Dataframe before calling compute()?

rlourenco · March 26, 2024, 3:37pm

Hi @guillaumeeb ! Nice to see you here too

Basically fairly simple setup, an adaptive SLURMCluster with 5-10 nodes, as follows:

node_share=8
cluster = SLURMCluster(
    processes=1, # Do not increase processes!
    cores=int(64/node_share), 
    memory=str(int(249/node_share))+'GB',
    interface='ib0',
    death_timeout=(60*5),
    local_directory='$SLURM_TMPDIR',#'/home/lourenco/scratch/dask_journaling_cache',
    log_directory='/home/lourenco/scratch/dask_slurm_logs',
    worker_extra_args=['--lifetime','55m',
                       '--lifetime-stagger', '4m',
                       '--lifetime-restart',
                      ],
    # job_extra_directives=['--exclusive',
    #         ],
    account="def-ggalex",
    walltime='01:01:00',
    
)

And running dask.dataframe.read_csv . That’s all.

I have “solved” my issue by replacing the CSV with a parquet file (actually, both of them were dumps from a Xarray dataset, with 105 million lines and 9 columns). But I don’t understand why I was having issues with the CSV load. Perhaps chunking? I tested values below 100MB, as recommended, but didn’t improve that constant rebalancing issue.

rlourenco · March 26, 2024, 3:48pm

I was reading this blog post by Coiled, and perhaps it is related with the difference in data structures when loading CSV vs parquet:

guillaumeeb · March 27, 2024, 3:59pm

My pleasure!

Although it’s understandable that using Parquet will be faster, and it is clearly recommended, I don’t see a good reason explaining why it would fail using a CSV file, 6GB is not that big. Do you have a single file or multiple files?

rlourenco · March 27, 2024, 7:25pm

Yes. It dazzled me too. It is a single file, dumped from an Xarray Dataset. I wonder if would be the data types on the csv getting mismatched, although I would believe that would cause an error while loading, not a rebalancing issue.

Topic		Replies	Views
Using dask's read_csv or pandas's read_csv in from_map? Dask DataFrame distributed	3	79	July 31, 2024
How does read_csv or read_parquet distribute read operations? Dask DataFrame	3	318	June 14, 2022
Bad performance with dask in k8s?	1	390	May 12, 2022
Memory filled up when compute dataframe-mean with 67 million rows Dask DataFrame	1	311	March 1, 2022
Memory limits reached in simple ETL-like data transformations Dask DataFrame worker	14	2496	March 30, 2023

Reading GB sized csv and immediately entering in an endless repartitioning loop

Related topics