Is it necessary to call compute() before calling to_parquet()?

I am trying to load two sets of csv files, merge them and save the result to a parquet file. The full data set is larger than memory so I a purposefully not calling compute() (which would bring everything into memory if my understanding is correct). However no data seems to be written.

Is there anyway to do this without running out of memory?

The data will later be filtered at which stage it will fit in memory, but I trying to avoid the concatenation and merging steps (along with some other processing) needing to be run each time.

Thanks

if __name__ == '__main__':
    stops = dd.read_csv('stop*.csv', delimiter=";", header=0)
    types = dd.read_csv('type*.csv', delimiter=";", header=0)
    
    merged = dd.merge(stops, types, on='date', how='left')
    
    #merged = merged.compute()    
    
    merged.to_parquet('myParquetFile.parquet')

So calling to_parquet without calling compute() is writing to a number of parquet files within a folder, whereas after calling compute() it is writing to a single file (presumably because it is saving a Pandas dataframe in that case and not a Dask one).

Apologies - I didn’t notice the folders that were being created, I was only looking at the files.

1 Like