How to use `dask_ml.datasets.make_blobs` to save a bigger than RAM dataset to disk?

NightMachinary · January 24, 2022, 12:35pm

How to use dask_ml.datasets.make_blobs to save a bigger than RAM dataset to disk? (using a single machine)

PS: Can I then fit dask_ml.cluster.KMeans on the saved file using that same machine? (basically out-of-core k-means)

pavithraes · January 24, 2022, 5:20pm

@NightMachinary Hi and welcome to Discourse!

dask_ml.datasets.make_blobs creates Dask Arrays which are evaluated lazily, so you can use it as you usually would, even to create larger-than-memory data:

from dask_ml.datasets import make_blobs

X, y = make_blobs(n_samples=10_000_000, chunks=1_000, n_features=200)

# X.nbytes / 1e9 # Output is 16GB, larger than my RAM

X.to_zarr("my_data_X.zarr") # store to zarr file
y.to_zarr("my_data_y.zarr")

Some notes:

You can similarly store these as hdf5 files, if you do so, you can also check out h5py.File.create_dataset.
You can convert+combine them into a Dask DataFrame and then store it as other file formats.

And, yes, you can fit a KMeans cluster on this large data (in fact, this is one of the main advantages of dask-ml!). But, I’d suggest fitting it before storing the data locally (because you’ll need to read it in again).

# read data back (if needed)
import dask.array as da
X = da.from_zarr("my_data_X.zarr", chunks=1000)

#fit KMeans cluster
from dask_ml.cluster import KMeans
kmeans = KMeans(n_clusters=4).fit(X)

Does this help answer your question?

Topic		Replies	Views
How to Parallel Saving Many Large Dask Arrays Distributed dask-array , delayed , future , distributed	4	359	January 17, 2023
Merging hundreds of NetCDF files into a single big NetCDF file on HPC Cluster Distributed	8	90	October 4, 2024
Saving large dask arrays one block at a time, without first persisting in memory Dask Array dask-array , distributed	2	865	April 27, 2023
Confused about working with sparse arrays Dask Array dask-array , sparse	1	742	April 12, 2023
DaskLGBMClassifier and Hypertuning using RandomizedSearchCV with DASK ECS Fargate Cluster Dask DataFrame future , distributed , dask-ml	2	568	March 23, 2023

How to use `dask_ml.datasets.make_blobs` to save a bigger than RAM dataset to disk?

Related topics