I may have an interesting use-case for dask_ml
, but it’s a bit unclear when dask_ml
is actually faster than scikit-learn. I ran the benchmark below, inspired by this source.
from dask_ml.datasets import make_classification
import pandas as pd
from timeit import default_timer as tic
import sklearn.linear_model
import dask_ml.linear_model
import seaborn as sns
Ns = [1_000, 2_500, 5_000, 10_000, 25_000, 50_000, 100_000, 200_000, 500_000]
timings = []
for n in Ns:
X, y = make_classification(n_samples=n, n_features=1000, random_state=n, chunks=1000)
t1 = tic()
sklearn.linear_model.LogisticRegression().fit(X, y)
timings.append(('Scikit-Learn', n, tic() - t1))
print(f"did dask with {n} items")
X, y = X.compute(), y.compute()
t1 = tic()
dask_ml.linear_model.LogisticRegression().fit(X, y)
timings.append(('dask-ml', n, tic() - t1))
print(f"did sklearn with {n} items")
df = pd.DataFrame(timings, columns=['method', 'Number of Samples', 'Fit Time'])
sns.factorplot(x='Number of Samples', y='Fit Time', hue='method',
data=df, aspect=1.5)
The resulting chart suggests that Dask is always slower to train. It’s an order of magnitude when the dataset becomes larger. Am I missing something?