import coiled
cluster = coiled.Cluster(name="ml", n_workers=16)
client = cluster.get_client()
print('Dashboard:', client.dashboard_link)
from dask.distributed import Client, get_client
from dask_ml.model_selection import HyperbandSearchCV, GridSearchCV, RandomizedSearchCV
import dask.dataframe as dd
import dask.array as da
import xgboost as xgb
import pandas as pd
import numpy as np
from dask.diagnostics import ProgressBar
model = xgb.XGBClassifier()
# Define parameter space
params = {
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [4, 6, 8],
'subsample': [0.7, 0.8, 0.9],
'colsample_bytree': [0.8, 1.0],
'n_estimators': [1000, 1500, 2000]
}
search = RandomizedSearchCV(model, params, n_iter=1)
dask_x = dd.from_pandas(df, npartitions=30).persist()
dask_y = dd.from_pandas(y, npartitions=30).persist()
search.fit(dask_x, dask_y)
print("Best score: ", search.best_score_)
print("Best parameters: ", search.best_params_)
Getting: UserWarning: Sending large graph of size 12.96 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures
Even after doing n_iter=1, it is not finishing in any reasonable amount of time (it takes <5 minutes to fit one parameter set locally and I am waiting for 30 minutes for one iteration to finish), the dashboard task stream is empty, there is no CPU utilization on any worker, coiled logs look like:
This task (hyperparameter optimization) seems to be an extremely common, hello world type activity for dask. Are there examples of it? The dataset “df” consists entirely of floats and is 100k rows, 3000 features.
Is there also a way to get more of a “progress report” during the course of execution, since I have no estimates on how long this will take, or results on the fly, or anything.