Bad performance while training model from SQL data using Dask cluster

Glebbot · March 7, 2025, 7:52pm

Hello everyone

I came across a problem when I was trying to train a model on data from PostgreSQL table using dask read_sql_table function.
I have a 12 million rows dataset and sklearn pipeline that looks like this:

imputer = SimpleImputer(strategy="most_frequent")
categorizer = Categorizer()
encoder = DummyEncoder()
regressor = RandomForestRegressor(n_estimators=3, max_depth=5)

Everything except the model is from dask-ml library
When I trained this pipeline from csv file with dask LocalCluster (4 workers) it works complitely fine, but when I tried to fit the same model with exactly the same data from sql table dask workers failed due to out of memory errors. I tried different npartitions sizes but got the same result.
In dask dashboard I noticed that workers usually fails while performing repartition task
I was able to fit the model with just 30000 rows from my table and training took way longer than training on full 12 million dataset from csv (even if csv file is loaded from S3 storage and not local file system). So is there any explanation for this behavior? Why training performance and memory usage is so much worse with data from sql or I am doing something wrong?

guillaumeeb · March 14, 2025, 9:10am

Hi @Glebbot, welcome to Dask community!

It will be a bit hard to help without more context. Could you give us the code you are using when reading from CSV, and the one using read_sql?

Anyway, a few thoughts:

Why are you doing a repartition? Are you doing it also when reading from CSV?

Did you tried some performance test on your Database, without using Dask at all? Cloud bandwith can be much more performant than multiple reads from a single Database. Where is your database located?

Training performance can be linked to IO performance. Memory usage should not if things are correctly partitionned.

Glebbot · March 19, 2025, 4:06pm

Hi @guillaumeeb
Thank you for your answer, I found the cause of this problem. It appears that read_sql_table doesn`t convert types (unlike read_csv), so all columns have been read as strings. So int and float columns were categorized and one hot encoded, and I ended up with a lot of extra columns in my dataframe. Solved problem by manyally casting DataFrame types

Topic		Replies	Views
Dask - Model training Dask DataFrame optimization , scheduler , worker , client , dask-ml	6	225	June 20, 2024
To_sql() query does not work for large files out-of-memory on dask cluster inside docker Dask DataFrame	7	307	February 24, 2022
Memory limits reached in simple ETL-like data transformations Dask DataFrame worker	14	2497	March 30, 2023
Dask.read_sql_table() too slow Distributed	7	348	June 16, 2023
Bad performance with dask in k8s?	1	390	May 12, 2022

Bad performance while training model from SQL data using Dask cluster

Related topics