[New User] How to use dask to handle a large file from concatenation of many small files

Hey,
I am a new dask user, drawn here by a specific problem I am facing now. Here is my problem:
I have a bunch of .nc files, each is small enough to be read into memory. To plot/process the whole dataset (all of the .nc files), the traditional method is to concatenate them first and then work on that single dataset containing the contents of all the .nc files. Now the total size of the .nc files I have are too big to allow this concatenation, and I want to use an HPC cluster to use multiple cores to process this dataset, how can Dask help with this?
I have seen xarray’s open_mfdataset() method and its again not immediately clear how that can help, since the .nc files contain several data variables whose dimensions are not the same (say, variable 1,3,6 can concatenate along a common dimension across all the .nc files, and 2,4,5 can concatenate along another dimension across all the .nc files; the traditional way is of course picking them out and concatenate individually).

Sorry if the question is not that clear. Please ask for any clarifications if needed.

Thanks!

1 Like

Have you read this blogpost?

1 Like