I have a folder with 8 Parquet files, each of between 1200-1500 MB. I’m trying to train an XGBoost model using dask, but I’m running into these errors on a single node with 4 CPUs and 64 GB of memory:
04/12/2023 11:39:36 PM:INFO:Event loop was unresponsive in Worker for 3.08s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
which leads to:
distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:43265 (pid=307) exceeded 95% memory budget. Restarting…
asyncio.exceptions.TimeoutError
2023-04-12 23:42:02 UTC – 04/12/2023 11:42:02 PM:INFO:Worker process 542 was killed by signal 9
My code is below:
from dask.dataframe import read_parquet
folder = '/code/data'
df = read_parquet(folder)
from dask.distributed import Client
client = Client(n_workers=4, threads_per_worker=1)
import dask
dask.config.set({"distributed.comm.timeouts.connect": "60s", "distributed.comm.timeouts.tcp": "60s"})
from dask_ml.model_selection import train_test_split
y = df['label']
X = df.drop(columns=['label'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
from xgboost.dask import DaskXGBClassifier
model = DaskXGBClassifier(n_estimators=100, tree_method="hist")
model.client = client # assign the client
model.fit(X_train, y_train)
How can I fix this?
Thanks