Parallel I/O using Dask

Wombat · July 3, 2024, 10:48pm

I have some code in Python that reads in several TIFF images, does some data processing, and then writes an output TIFF to disk. The output file is very large and writing it to disk takes a long time as the IO is done in serial.

Does dask support parallel IO operations? Using dask, can I have multiple workers write to different sections of a single file simultaneously?

Also, using dask, is it possible to distribute IO operations across the nodes in a cluster?

guillaumeeb · July 4, 2024, 3:30pm

Hi @Wombat, welcome to Dask community!

I’m not aware of any Python library that support parallel IO on TIFF files, but I’d like to be proved wrong, and that could be some game changer!

Since Dask use Python processes, I don’t see that can be done.

It is, but again not in the same file. The workaround here is to create several TIFF files, and maybe create a VRT file to have a virtual dataset stacking all the files one way or another.

cc @martindurant

martindurant · July 4, 2024, 5:59pm

Yes, parallel access to TIFF is possible and normal. In particular, you should read up on “cloud optimized geoTIFF (CoG)”, which is has chunking designed with this in mind. As the name suggests, CoG is mainly used in earth sciences.

You may want to check out how xarray handles TIFF files (it uses rasterio, dask example here), that’s possibly the best analysis package for this kind of thing. It can transparently use dask and concat/stack your images if they have appropriate coordinates information. I don’t know, but I expect it uses a lock for writing TIFFs across a cluster.

See also: tifffile, which can extract the buffer locations in a TIFF, and can be used by kerchunk for making virtual zarr datasets out of tiffs.

guillaumeeb · July 4, 2024, 7:57pm

@martindurant, I’m a bit curious about what you say. If I totally agree that parallel reading of TIFF file is doable in Python, I’m a bit more sceptical about parallel writes. I mean it in the sense of MPI-IO like tools, that allow real concurrent writes on a single file in a distributed way.

To the question

I believe the answer is no, isn’t that right?

It’s nice to have a mechanism already handling it through lock though, but the performances must be affected.

martindurant · July 4, 2024, 8:31pm

The rioxarray example I linked suggests that you can have multiple workers “involved” in writing to a single file, but there is a lock, and I don’t know how it works in practice; nor do I know whether the xarray API uses this code path.

So you could indeed write to separate files (as you suggested) and later use xarray (or kerchunk) to form a logical dataset over the multiple output files.

Topic		Replies	Views
Using Dask with ITK Distributed dask-array , distributed	12	708	March 23, 2023
Parallelizing pipeline across threads Distributed	14	1570	February 19, 2023
Merging hundreds of NetCDF files into a single big NetCDF file on HPC Cluster Distributed	8	139	October 4, 2024
Dask and nd2/tif Distributed dask-image	1	16	August 22, 2024
Unable to create Dask dataframe at scale Dask DataFrame	6	890	October 22, 2022

Parallel I/O using Dask

Related topics