Hello everyone
I came across a problem when I was trying to train a model on data from PostgreSQL table using dask read_sql_table function.
I have a 12 million rows dataset and sklearn pipeline that looks like this:
imputer = SimpleImputer(strategy="most_frequent")
categorizer = Categorizer()
encoder = DummyEncoder()
regressor = RandomForestRegressor(n_estimators=3, max_depth=5)
Everything except the model is from dask-ml library
When I trained this pipeline from csv file with dask LocalCluster (4 workers) it works complitely fine, but when I tried to fit the same model with exactly the same data from sql table dask workers failed due to out of memory errors. I tried different npartitions sizes but got the same result.
In dask dashboard I noticed that workers usually fails while performing repartition task
I was able to fit the model with just 30000 rows from my table and training took way longer than training on full 12 million dataset from csv (even if csv file is loaded from S3 storage and not local file system). So is there any explanation for this behavior? Why training performance and memory usage is so much worse with data from sql or I am doing something wrong?