Errors training xgboost with parquet files on single node

I have a folder with 8 Parquet files, each of between 1200-1500 MB. I’m trying to train an XGBoost model using dask, but I’m running into these errors on a single node with 4 CPUs and 64 GB of memory:

04/12/2023 11:39:36 PM:INFO:Event loop was unresponsive in Worker for 3.08s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

which leads to:
distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43265 (pid=307) exceeded 95% memory budget. Restarting…

asyncio.exceptions.TimeoutError
2023-04-12 23:42:02 UTC – 04/12/2023 11:42:02 PM:INFO:Worker process 542 was killed by signal 9

My code is below:

from dask.dataframe import read_parquet
folder = '/code/data'
df = read_parquet(folder)

from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
import dask
dask.config.set({"distributed.comm.timeouts.connect": "60s", "distributed.comm.timeouts.tcp": "60s"})

from dask_ml.model_selection import train_test_split
y = df['label']
X = df.drop(columns=['label'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

from xgboost.dask import DaskXGBClassifier
model = DaskXGBClassifier(n_estimators=100, tree_method="hist")
model.client = client  # assign the client
model.fit(X_train, y_train)

How can I fix this?

Thanks

Hi @skunkwerk, welcome to Dask Discourse forum!

First of all, these are big files! By default, Dask will create one chunk per file. Can you confirm using the Dashboard or printing the Dataframe info that you’ve got 8 partitions? Since Parquet is compressed, 1,5 GiB could lead to several Gigs of dta per chunk in memory. Each of your workers having 16GiB, this could fill it up quickly.

You might want to look at split_row_groups kwarg of read_parquet to solve this.

Then, if this doesn’t fix the problem, I would first try to do some simple manipulation on the dataset to see if it succeeds, like computing a mean or statistics over the dataset. If this works, then you might want to tune the XGBoost part.

thanks. I switched to split_row_groups=‘adaptive’ and was able to do a simple computation on the dataframe like:

df[‘label’].mean().compute()

which worked fine. However, training an Xgboost model still fails with the same error as above.

Okay, so you are able to read and do some computations on the Dataframe, which is good.

From there, I would go step by step:

  • Can you generate your input X and Y arrays and save them to Zarr?
  • Can you train a simpler model, say a Linear model, or a less complex XGBoost model?