Hi Dask forum,
I can’t find a clear best practices on what feels like a basic question: How can dask
or dask-ml
be used to parallelize model fitting over different model frames? Most guidance appears oriented towards parameter tuning and meta-estimators.
Somewhat related questions:
- Meta-Estimators with Multiple Models
- Issue in Parallel row preprocessing with Dask
- Perform the same operation on all columns of a dask dataframe in parallel
- Clarification on Distributed Dask ML (Is ML really Distributed?)
As a minimal example, let’s say I spin up a client on a SLURM-managed HPC and want to fit a million different linear models using eg sklearn.linear_model.LinearRegression
(or an sklearn
model with an n_jobs
argument that uses joblib
to parallelize, or another module that similarly relies on BLAS and LAPACK for linalg, or etc.):
import dask
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
import numpy as np
from sklearn.linear_model import LinearRegression
cluster = SLURMCluster(<args>)
cluster.scale(4)
client = Client(cluster)
client.wait_for_workers(4)
rng = np.random.default_rng()
N, P = 100, 10
designs = [rng.normal(size=[N, P]) for i in range(int(1E6))]
responses = [rng.normal(size=P) for i in range(int(1E6))]
# naive implementation
for X,y in zip(designs, responses):
reg = dask.delayed(LinearRegression().fit(X, y))
res = dask.compute(*reg)
What’s the best way to distribute these fits on a cluster?