First time user: Coiled + RandomizedSearchCV hanging indefinitely, any way to get progress?

import coiled
cluster = coiled.Cluster(name="ml", n_workers=16)
client = cluster.get_client()
print('Dashboard:', client.dashboard_link)

from dask.distributed import Client, get_client
from dask_ml.model_selection import HyperbandSearchCV, GridSearchCV, RandomizedSearchCV
import dask.dataframe as dd
import dask.array as da
import xgboost as xgb
import pandas as pd
import numpy as np
from dask.diagnostics import ProgressBar

model = xgb.XGBClassifier()

# Define parameter space
params = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [4, 6, 8],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.8, 1.0],
    'n_estimators': [1000, 1500, 2000]
}

search = RandomizedSearchCV(model, params, n_iter=1)
dask_x = dd.from_pandas(df, npartitions=30).persist()
dask_y = dd.from_pandas(y, npartitions=30).persist()
search.fit(dask_x, dask_y)
print("Best score: ", search.best_score_)
print("Best parameters: ", search.best_params_)

Getting: UserWarning: Sending large graph of size 12.96 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures

Even after doing n_iter=1, it is not finishing in any reasonable amount of time (it takes <5 minutes to fit one parameter set locally and I am waiting for 30 minutes for one iteration to finish), the dashboard task stream is empty, there is no CPU utilization on any worker, coiled logs look like:

This task (hyperparameter optimization) seems to be an extremely common, hello world type activity for dask. Are there examples of it? The dataset “df” consists entirely of floats and is 100k rows, 3000 features.

Is there also a way to get more of a “progress report” during the course of execution, since I have no estimates on how long this will take, or results on the fly, or anything.

Hi @jmy48, welcome to Dask community!

I’m no expert in XGBoost, so I’m not sure if my answer will be relevant. I tried to make a small reproducer out of your example, ended with the code below:

import coiled
cluster = coiled.Cluster(name="ml", n_workers=16)
client = cluster.get_client()
print('Dashboard:', client.dashboard_link)

from dask.distributed import Client, get_client
from dask_ml.model_selection import HyperbandSearchCV, GridSearchCV, RandomizedSearchCV
import dask.dataframe as dd
import dask.array as da
import xgboost as xgb
import pandas as pd
import numpy as np
from dask.diagnostics import ProgressBar

model = xgb.XGBClassifier()

# Define parameter space
params = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [4, 6, 8],
    'n_estimators': [100, 150, 200]
}

search = RandomizedSearchCV(model, params, n_iter=2)
num_obs = 1e4
num_features = 20
X = da.random.random(size=(num_obs, num_features), chunks=(num_obs / 10, num_features))
y = da.random.choice(10, num_obs, chunks=(num_obs / 10,))

search.fit(X, y)
print("Best score: ", search.best_score_)
print("Best parameters: ", search.best_params_)

With this, I do see a normal behavior on the Coiled cluster. I really simplified the model, and probably the input data to learn on too, in order to have small timings and be able to easily test. I can see task appearing and finishing on the Dashboard, everything looks fine.

At first, I used complete random arrays even for y, and dealt with model learning time that never finished on my laptop.

The graph size is not so big, but maybe it would be worth trying to understand why you get this message.

This is really strange, you should at least see the tasks appearing on the progress chart, even if they take ages to finish, and some Worker CPU usage.

Well there are plenty of charts on the Dashboard, and other tooling outside of it (see Diagnostics (distributed) — Dask documentation). But the majority of them rely on tasks completion. You could probably use also as_completed for getting results on the fly.

I assume the large graph size warning comes from using dd.from_pandas which is taking your local pandas dataframe and is shipping this to the cluster.
If you have a slow or unstable network connection, this would also explain why the cluster doesn’t appear to do a lot. It will only kick off any computation once the graph + data has been submitted entirely and we don’t have a progress report for this step (typically this should be subsecond / few seconds).
This should not happen if you are using a remote storage location (S3) and an API like read_parquet/read_csv/etc.