What speedups can I expect from training with DaskML?

I may have an interesting use-case for dask_ml , but it’s a bit unclear when dask_ml is actually faster than scikit-learn. I ran the benchmark below, inspired by this source.

from dask_ml.datasets import make_classification
import pandas as pd

from timeit import default_timer as tic
import sklearn.linear_model
import dask_ml.linear_model
import seaborn as sns

Ns = [1_000, 2_500, 5_000, 10_000, 25_000, 50_000, 100_000, 200_000, 500_000]

timings = []

for n in Ns:
    X, y = make_classification(n_samples=n, n_features=1000, random_state=n, chunks=1000)
    t1 = tic()
    sklearn.linear_model.LogisticRegression().fit(X, y)
    timings.append(('Scikit-Learn', n, tic() - t1))
    print(f"did dask with {n} items")
    X, y = X.compute(), y.compute()
    t1 = tic()
    dask_ml.linear_model.LogisticRegression().fit(X, y)
    timings.append(('dask-ml', n, tic() - t1))
    print(f"did sklearn with {n} items")


df = pd.DataFrame(timings, columns=['method', 'Number of Samples', 'Fit Time'])
sns.factorplot(x='Number of Samples', y='Fit Time', hue='method',
               data=df, aspect=1.5)

The resulting chart suggests that Dask is always slower to train. It’s an order of magnitude when the dataset becomes larger. Am I missing something?

Part of me is thinking that I may need to mind the chunks parameter in make_classification. However, it raises an error when I set it to a lower number.

make_classification(n_samples=n, n_features=1000, random_state=n, chunks=20)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-30-5c1887f5dfec> in <module>
----> 1 make_classification(n_samples=n, n_features=1000, random_state=n, chunks=20)

~/Development/scikit-lego/venv/lib/python3.7/site-packages/dask_ml/datasets.py in make_classification(n_samples, n_features, n_informative, n_redundant, n_repeated, n_classes, n_clusters_per_class, weights, flip_y, class_sep, hypercube, shift, scale, shuffle, random_state, chunks)
    360 ):
    361     chunks = da.core.normalize_chunks(chunks, (n_samples, n_features))
--> 362     _check_axis_partitioning(chunks, n_features)
    363 
    364     if n_classes != 2:

~/Development/scikit-lego/venv/lib/python3.7/site-packages/dask_ml/datasets.py in _check_axis_partitioning(chunks, n_features)
     21             "\tn_features: {}".format(c, n_features)
     22         )
---> 23         raise ValueError(msg)
     24 
     25 

ValueError: Can only generate arrays partitioned along the first axis. Specifying a larger chunksize for the second axis.

	chunk size: 20
	n_features: 1000

It also deserves mentioning that I do measure a significant boost in using the ParallelPostFit from dask_ml.

import numpy as np

from timeit import default_timer as tic

import pandas as pd
import seaborn as sns
import sklearn.datasets
from sklearn.svm import SVC

import dask_ml.datasets
from dask_ml.wrappers import ParallelPostFit

X, y = sklearn.datasets.make_classification(n_samples=1000)
clf = ParallelPostFit(SVC(gamma='scale'))
clf.fit(X, y)


Ns = list(np.linspace(200, 10000, 15).astype(int))
timings = []


for n in Ns:
    X, y = dask_ml.datasets.make_classification(n_samples=n, random_state=42, chunks=n // 10)
    X_np, y_np = X.compute(), y.compute()
    t1 = tic()
    clf.estimator.predict(X_np)
    timings.append(('Scikit-Learn', n, tic() - t1))

    t1 = tic()
    clf.predict(X).compute()
    timings.append(('dask-ml', n, tic() - t1))


df = pd.DataFrame(timings,
                  columns=['method', 'Number of Samples', 'Predict Time'])
ax = sns.factorplot(x='Number of Samples', y='Predict Time', hue='method',
                    data=df, aspect=1.5)

There are some good explanations on this page on when to use dask-ml:
https://ml.dask.org/#dimensions-of-scale

With LogisticRegression, I’d say this is only when input data is bigger than memory. Else you’ve got some cost to parallelize things that are not worth it. It also depends on the computing resources on which you are doing the computation, how many cores and memory do you have?

To be fair, I’ve no idea on how the distribution works with the LogisticRegression in dask-ml. I reproduce your results (using only one core), and maybe this algorithm is not up to date on Dask part.

You should probably use chunks=(20,1000), giving the shape of each chunk. But I don’t think this will speed things up, instead this will probably be worse with too many chunks.

This part is easy to distribute over many cores (embarrassingly parallel), so you should see a speed-up giving you the number or cores you have (4?).

So my conclusion is that you should not use dask-ml for LogisticRegression, because this is not an easily distributed algorithm, except when you’ve got really a lot of data. And even in this case, using data samples with sklearn might be better.

4 Likes