To_sql() query does not work for large files out-of-memory on dask cluster inside docker

I am trying to create a sql table from a large (4GiB) csv file by using following dask command, on a dask cluster with 2 workers (1GiB memeory each), to simulate out-of-memory computation.

        ddata = dd.read_csv(file_path, blocksize=raw_blocksize).set_index('Unnamed: 0')
        ddata.to_sql(table_name, PostgresDBConnection().postgres_url, parallel=True, if_exists='replace')

This code works fine for files upto 2GiB, but for larger files below error is seen contineously on console logs and after writing 50% of data into sql, the process hangs.

dask_worker_1        | distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 728.89 MiB -- Worker memory limit: 1.00 GiB

dask_worker_2        | distributed.worker - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker.html#memtrim for more information. -- Unmanaged memory: 783.55 MiB -- Worker memory limit: 1.00 GiB

This is the status on dask dashboard when the sql writing process is hung with continuous WARNING traceback as mentioned above.

Ideally, I was expecting the grey horizontal bars (indicating memory spill) while the process still executes. Am I missing out something?

I was hoping this github bug would help me with the fix.
However, no luck

Hi @SOUMYASHUKLA, thanks for this question! You might find this post on tackling unmanaged memory helpful. You’re already doing the first step mentioned (tracking with the dashboard), but there are additional suggestions such as garbage collection and memory trimming that may help.

Do you have the same issue when saving the file to disk, or is this specific to loading to SQL, as the title of your post suggests?

Hello @scharlottej13. Thank you for your response.
I have already tried the memory trim methods as mentioned in this post, however the memory problems still persists. This issue is specific to loading data into sql.

@SOUMYASHUKLA I’m again wondering if increasing the number of partitions in ddata will help?

@pavithraes Thank you for your response. I did try reducing the blocksize to increase the number of partitions. However, to get things working, I ended up looping through each partition to save the data into postgres. :slight_smile:

@SOUMYASHUKLA Thanks for sharing, I’m happy you found a work-around! I still feel Dask’s to_sql() should be able to handle this directly, so I’ll continue thinking about this.

One idea that comes to mind, to_sql triggers computation if you don’t call compute() or persist() before it – which sometimes leads to misdiagnosis of what operation is actually causing the error. So, I might try calling .persist() before to_sql and see if it helps.

And, to confirm, when you say ‘looping through each partition’, do you mean using a for-loop, or using map_partitions ? I’d suggest checking out map_partitions if you’re not using it already!

1 Like