[New User] How to use dask to handle a large file from concatenation of many small files

axelwang · October 31, 2022, 12:52am

Hey,
I am a new dask user, drawn here by a specific problem I am facing now. Here is my problem:
I have a bunch of .nc files, each is small enough to be read into memory. To plot/process the whole dataset (all of the .nc files), the traditional method is to concatenate them first and then work on that single dataset containing the contents of all the .nc files. Now the total size of the .nc files I have are too big to allow this concatenation, and I want to use an HPC cluster to use multiple cores to process this dataset, how can Dask help with this?
I have seen xarray’s open_mfdataset() method and its again not immediately clear how that can help, since the .nc files contain several data variables whose dimensions are not the same (say, variable 1,3,6 can concatenate along a common dimension across all the .nc files, and 2,4,5 can concatenate along another dimension across all the .nc files; the traditional way is of course picking them out and concatenate individually).

Sorry if the question is not that clear. Please ask for any clarifications if needed.

Thanks!

jakirkham · February 17, 2023, 11:15pm

Have you read this blogpost?

Topic		Replies	Views
How to concatenate xarray-datasets read from zarr with dask	3	1094	July 5, 2023
What is the best approach to manage a large number of small arrays? Dask Array	1	239	September 28, 2022
Reading data slices from multiple HDF5 files Distributed dask-array , distributed	5	1415	January 4, 2022
Best practice to distribute Distributed distributed	5	568	January 11, 2022
54k of small files. Is Dask good for it? Dask DataFrame	7	632	March 17, 2023

[New User] How to use dask to handle a large file from concatenation of many small files

Related topics