How to use `dask_ml.datasets.make_blobs` to save a bigger than RAM dataset to disk?

How to use dask_ml.datasets.make_blobs to save a bigger than RAM dataset to disk? (using a single machine)

PS: Can I then fit dask_ml.cluster.KMeans on the saved file using that same machine? (basically out-of-core k-means)

@NightMachinary Hi and welcome to Discourse!

dask_ml.datasets.make_blobs creates Dask Arrays which are evaluated lazily, so you can use it as you usually would, even to create larger-than-memory data:

from dask_ml.datasets import make_blobs

X, y = make_blobs(n_samples=10_000_000, chunks=1_000, n_features=200)

# X.nbytes / 1e9 # Output is 16GB, larger than my RAM

X.to_zarr("my_data_X.zarr") # store to zarr file
y.to_zarr("my_data_y.zarr")

Some notes:

And, yes, you can fit a KMeans cluster on this large data (in fact, this is one of the main advantages of dask-ml!). But, I’d suggest fitting it before storing the data locally (because you’ll need to read it in again).

# read data back (if needed)
import dask.array as da
X = da.from_zarr("my_data_X.zarr", chunks=1000)

#fit KMeans cluster
from dask_ml.cluster import KMeans
kmeans = KMeans(n_clusters=4).fit(X)

Does this help answer your question?

1 Like