How does read_csv or read_parquet distribute read operations?

rueian · June 12, 2022, 10:11am

Hi, I am new to dask. I would like to understand how read_csv and read_parquet work.

That say I have a dask cluster with 50 workers. And what will happen if I execute the following code on the cluster?

df = dd.read_csv('gcs://bucket/5TB.csv')
df.persist()

Will every workers fully download the the whole 5TB.csv? Or will each worker only downloads 1/50 of the 5TB.csv?

martindurant · June 14, 2022, 12:57pm

The client will look at the file size, and split into equal-sized blocks of bytes. Each task works on one of these blocks on one of the workers. It reads that block of bytes and looks for line-endings (\n), and then parses this block of text as CSV, outputting a dataframe. When done, each task will evaluate to a pandas dataframe, part of the overall dataframe for the whole file. Each of these pieces will be in memory on one of the workers.

rueian · June 14, 2022, 1:02pm

Hi martindurant,

Thank you for the reply. So, each worker will use http range requests to download part of 5TB.csv from gcs, right?

What about read_parquet? Does it also split into equal-sized blocks of bytes to download?

martindurant · June 14, 2022, 1:09pm

Read_parquet will use whatever chunking is embedded in the file, one parquet “row-group” per component dataframe/partition.

Topic		Replies	Views
Dask read_csv() multiple files but separate partition for each file Dask DataFrame	4	939	January 24, 2024
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1713	April 6, 2023
Option to batch read_parquet Dask DataFrame parquet	1	177	April 11, 2023
Slow processing of parquet dataset using the distributed client Dask DataFrame distributed	1	372	October 11, 2022
Using dask's read_csv or pandas's read_csv in from_map? Dask DataFrame distributed	3	81	July 31, 2024

How does read_csv or read_parquet distribute read operations?

Related topics