Is it necessary to call compute() before calling to_parquet()?

amp123 · March 28, 2024, 5:34pm

I am trying to load two sets of csv files, merge them and save the result to a parquet file. The full data set is larger than memory so I a purposefully not calling compute() (which would bring everything into memory if my understanding is correct). However no data seems to be written.

Is there anyway to do this without running out of memory?

The data will later be filtered at which stage it will fit in memory, but I trying to avoid the concatenation and merging steps (along with some other processing) needing to be run each time.

Thanks

if __name__ == '__main__':
    stops = dd.read_csv('stop*.csv', delimiter=";", header=0)
    types = dd.read_csv('type*.csv', delimiter=";", header=0)
    
    merged = dd.merge(stops, types, on='date', how='left')
    
    #merged = merged.compute()    
    
    merged.to_parquet('myParquetFile.parquet')

amp123 · March 29, 2024, 10:22am

So calling to_parquet without calling compute() is writing to a number of parquet files within a folder, whereas after calling compute() it is writing to a single file (presumably because it is saving a Pandas dataframe in that case and not a Dask one).

Apologies - I didn’t notice the folders that were being created, I was only looking at the files.

Topic		Replies	Views
Memory issues arising from writing partitions with to_parquet	5	769	September 18, 2023
Improving pipeline resilience when using `to_parquet` and preemptible workers Dask DataFrame distributed	5	445	August 25, 2023
How to handle a Dask DF in multiple modules? Dask DataFrame	6	576	February 8, 2023
Read Parquet with Varying Schemas Dask DataFrame	4	727	February 7, 2024
How to save the database so that it is readable for the dataframe Dask DataFrame	2	399	April 14, 2022

Is it necessary to call compute() before calling to_parquet()?

Related topics