How does read_csv or read_parquet distribute read operations?

Hi, I am new to dask. I would like to understand how read_csv and read_parquet work.

That say I have a dask cluster with 50 workers. And what will happen if I execute the following code on the cluster?

df = dd.read_csv('gcs://bucket/5TB.csv')
df.persist()

Will every workers fully download the the whole 5TB.csv? Or will each worker only downloads 1/50 of the 5TB.csv?

The client will look at the file size, and split into equal-sized blocks of bytes. Each task works on one of these blocks on one of the workers. It reads that block of bytes and looks for line-endings (\n), and then parses this block of text as CSV, outputting a dataframe. When done, each task will evaluate to a pandas dataframe, part of the overall dataframe for the whole file. Each of these pieces will be in memory on one of the workers.

1 Like

Hi martindurant,

Thank you for the reply. So, each worker will use http range requests to download part of 5TB.csv from gcs, right?

What about read_parquet? Does it also split into equal-sized blocks of bytes to download?

Read_parquet will use whatever chunking is embedded in the file, one parquet “row-group” per component dataframe/partition.

1 Like