Processing all rows of a large database table: a good use case?

rodriguealcazar · April 9, 2024, 3:14pm

Hi,

I have a use case where I want to take all rows in a large Postgres database table, do a bit of computation for each row separately, and write something back to the database. I am new to Dask and to me it looks like it would be a good use case for it as this is something that will require a lot of memory and that is massively parallel.

However, I wanted to confirm that this is indeed something Dask would be good for? And if so, should I do everything in Dask Dataframes or maybe first load the data “manually” and use Dask only for the map operation?

Thanks

guillaumeeb · April 11, 2024, 7:48am

Hi @rodriguealcazar, welcome to Dask community!

Well yes, especially if processing each row is a bit compute intensive and require memory, Dask sounds like a great fit.

I would advise using Dask Dataframes if possible, using read_sql. This is always more straightforward. But be careful on how you’ll split the partitions if you need a lot of memory.

Topic		Replies	Views
Loading large dataset from postgres using the minimun amount of memory	3	348	April 13, 2023
Dask suitability for my use case	1	63	February 16, 2024
Dask.read_sql_table() too slow Distributed	7	335	June 16, 2023
Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto Dask DataFrame delayed	3	72	September 19, 2024
Need help with efficient parallelization [local machine] Distributed delayed , distributed	2	247	July 30, 2022

Processing all rows of a large database table: a good use case?

Related topics