Loading large dataset from postgres using the minimun amount of memory

luis.casas · April 13, 2023, 7:58am

Hello!, we are trying to load a large dataset from a postgresql database using psycopg2 and pandas. The problem is the high memory usage. Is posible use dask for reduce the memory usage of the request?
Thanks!

guillaumeeb · April 13, 2023, 1:19pm

Hi @luis.casas, welcome to Dask community!

It’s a little hard to tell without knowing your workflow.
Dask can help if you don’t need the complete dataset in memory at a given time. It ca read the data by chunk, thus freeing memory one a chunk has been processed, but it depends what you want to do with this dataset.

Say you just want to do some simple ETL, read data, process by chunk, and write it back into Parquet file format. Then Dask will enable you to do this streaming the input data and keeping memory usage to a lower level.

luis.casas · April 13, 2023, 2:26pm

Thanks for the quick reply!
My question is more about whether it is possible to perform operations on a large dataset without loading it into memory entirely. We can’t use chunks because the operation needs to be done in one go.
Yes, it would be something like the ETL example but the step of reading the data completely exhausts the container memory. Any ideas?

guillaumeeb · April 13, 2023, 2:29pm

Well, it depends on your workflow. Could you provide some reproducible example? Or a code snippet?

Dask is first about performing operations without loading it into memory entirely, but it supposes the operation can be done by chunk in a map/reduce fashion.

Topic		Replies	Views
Processing all rows of a large database table: a good use case? Distributed	1	86	April 11, 2024
How to use Dask to overcome out-of-memory problem? Dask DataFrame	9	2037	February 14, 2023
To_sql() query does not work for large files out-of-memory on dask cluster inside docker Dask DataFrame	7	311	February 24, 2022
How to combine lazy loading and pagination? Distributed distributed	1	65	September 11, 2024
Using dask's read_csv or pandas's read_csv in from_map? Dask DataFrame distributed	3	82	July 31, 2024

Loading large dataset from postgres using the minimun amount of memory

Related topics