Bad performance while training model from SQL data using Dask cluster

Hello everyone

I came across a problem when I was trying to train a model on data from PostgreSQL table using dask read_sql_table function.
I have a 12 million rows dataset and sklearn pipeline that looks like this:

imputer = SimpleImputer(strategy="most_frequent")
categorizer = Categorizer()
encoder = DummyEncoder()
regressor = RandomForestRegressor(n_estimators=3, max_depth=5)

Everything except the model is from dask-ml library
When I trained this pipeline from csv file with dask LocalCluster (4 workers) it works complitely fine, but when I tried to fit the same model with exactly the same data from sql table dask workers failed due to out of memory errors. I tried different npartitions sizes but got the same result.
In dask dashboard I noticed that workers usually fails while performing repartition task
I was able to fit the model with just 30000 rows from my table and training took way longer than training on full 12 million dataset from csv (even if csv file is loaded from S3 storage and not local file system). So is there any explanation for this behavior? Why training performance and memory usage is so much worse with data from sql or I am doing something wrong?

Hi @Glebbot, welcome to Dask community!

It will be a bit hard to help without more context. Could you give us the code you are using when reading from CSV, and the one using read_sql?

Anyway, a few thoughts:

Why are you doing a repartition? Are you doing it also when reading from CSV?

Did you tried some performance test on your Database, without using Dask at all? Cloud bandwith can be much more performant than multiple reads from a single Database. Where is your database located?

Training performance can be linked to IO performance. Memory usage should not if things are correctly partitionned.

Hi @guillaumeeb
Thank you for your answer, I found the cause of this problem. It appears that read_sql_table doesn`t convert types (unlike read_csv), so all columns have been read as strings. So int and float columns were categorized and one hot encoded, and I ended up with a lot of extra columns in my dataframe. Solved problem by manyally casting DataFrame types

1 Like