How to use dask_ml.datasets.make_blobs
to save a bigger than RAM dataset to disk? (using a single machine)
PS: Can I then fit dask_ml.cluster.KMeans
on the saved file using that same machine? (basically out-of-core k-means)
How to use dask_ml.datasets.make_blobs
to save a bigger than RAM dataset to disk? (using a single machine)
PS: Can I then fit dask_ml.cluster.KMeans
on the saved file using that same machine? (basically out-of-core k-means)
@NightMachinary Hi and welcome to Discourse!
dask_ml.datasets.make_blobs
creates Dask Arrays which are evaluated lazily, so you can use it as you usually would, even to create larger-than-memory data:
from dask_ml.datasets import make_blobs
X, y = make_blobs(n_samples=10_000_000, chunks=1_000, n_features=200)
# X.nbytes / 1e9 # Output is 16GB, larger than my RAM
X.to_zarr("my_data_X.zarr") # store to zarr file
y.to_zarr("my_data_y.zarr")
Some notes:
hdf5
files, if you do so, you can also check out h5py.File.create_dataset
.And, yes, you can fit a KMeans
cluster on this large data (in fact, this is one of the main advantages of dask-ml
!). But, I’d suggest fitting it before storing the data locally (because you’ll need to read it in again).
# read data back (if needed)
import dask.array as da
X = da.from_zarr("my_data_X.zarr", chunks=1000)
#fit KMeans cluster
from dask_ml.cluster import KMeans
kmeans = KMeans(n_clusters=4).fit(X)
Does this help answer your question?