Errors training xgboost with parquet files on single node

skunkwerk · April 13, 2023, 5:38pm

I have a folder with 8 Parquet files, each of between 1200-1500 MB. I’m trying to train an XGBoost model using dask, but I’m running into these errors on a single node with 4 CPUs and 64 GB of memory:

04/12/2023 11:39:36 PM:INFO:Event loop was unresponsive in Worker for 3.08s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

which leads to:
distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43265 (pid=307) exceeded 95% memory budget. Restarting…

asyncio.exceptions.TimeoutError
2023-04-12 23:42:02 UTC – 04/12/2023 11:42:02 PM:INFO:Worker process 542 was killed by signal 9

My code is below:

from dask.dataframe import read_parquet
folder = '/code/data'
df = read_parquet(folder)

from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
import dask
dask.config.set({"distributed.comm.timeouts.connect": "60s", "distributed.comm.timeouts.tcp": "60s"})

from dask_ml.model_selection import train_test_split
y = df['label']
X = df.drop(columns=['label'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

from xgboost.dask import DaskXGBClassifier
model = DaskXGBClassifier(n_estimators=100, tree_method="hist")
model.client = client  # assign the client
model.fit(X_train, y_train)

How can I fix this?

Thanks

guillaumeeb · April 14, 2023, 12:21pm

Hi @skunkwerk, welcome to Dask Discourse forum!

First of all, these are big files! By default, Dask will create one chunk per file. Can you confirm using the Dashboard or printing the Dataframe info that you’ve got 8 partitions? Since Parquet is compressed, 1,5 GiB could lead to several Gigs of dta per chunk in memory. Each of your workers having 16GiB, this could fill it up quickly.

You might want to look at split_row_groups kwarg of read_parquet to solve this.

Then, if this doesn’t fix the problem, I would first try to do some simple manipulation on the dataset to see if it succeeds, like computing a mean or statistics over the dataset. If this works, then you might want to tune the XGBoost part.

skunkwerk · April 24, 2023, 6:46pm

thanks. I switched to split_row_groups=‘adaptive’ and was able to do a simple computation on the dataframe like:

df[‘label’].mean().compute()

which worked fine. However, training an Xgboost model still fails with the same error as above.

guillaumeeb · April 28, 2023, 6:15am

Okay, so you are able to read and do some computations on the Dataframe, which is good.

From there, I would go step by step:

Can you generate your input X and Y arrays and save them to Zarr?
Can you train a simpler model, say a Linear model, or a less complex XGBoost model?

Topic		Replies	Views
FileNotFoundError when reading input in distributed cluster	2	197	February 24, 2025
Dask - Model training Dask DataFrame optimization , scheduler , worker , client , dask-ml	6	219	June 20, 2024
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1697	April 6, 2023
Dask computation takes way too much memory Dask DataFrame distributed	5	1002	December 27, 2023
Dask Memory Leak Workaround Dask DataFrame	8	2521	March 28, 2023

Errors training xgboost with parquet files on single node

Related topics