Hi, I am new to dask. I would like to understand how read_csv and read_parquet work.
That say I have a dask cluster with 50 workers. And what will happen if I execute the following code on the cluster?
df = dd.read_csv('gcs://bucket/5TB.csv')
Will every workers fully download the the whole 5TB.csv? Or will each worker only downloads 1/50 of the 5TB.csv?
The client will look at the file size, and split into equal-sized blocks of bytes. Each task works on one of these blocks on one of the workers. It reads that block of bytes and looks for line-endings (\n), and then parses this block of text as CSV, outputting a dataframe. When done, each task will evaluate to a pandas dataframe, part of the overall dataframe for the whole file. Each of these pieces will be in memory on one of the workers.
Thank you for the reply. So, each worker will use http range requests to download part of 5TB.csv from gcs, right?
What about read_parquet? Does it also split into equal-sized blocks of bytes to download?
Read_parquet will use whatever chunking is embedded in the file, one parquet “row-group” per component dataframe/partition.