I have following problem: I want to join multiple DataFrames but I always run out of memory! My goal is to achieve an operation like polars.concat(how=‘diagonal’) or repeating outer joins.
I have simplified my code to this:
trace_path = Path('x') dfs =  for span_path in trace_path.iterdir(): df = dd.read_csv(span_path) df.repartition(npartitions=1) df.set_index('id', sorted=True) dfs.append(df) master_df = dfs for df in dfs[1:]: intersecting_columns = list(set(master_df.columns).intersection(set(df.columns))) master_df = master_df.merge(df, how='outer', on=intersecting_columns) master_df = master_df.repartition(npartitions=10) master_df.dask.visualize(filename='high-test.svg', optimize_graph='True') master_df.visualize(filename='low-test.svg', optimize_graph='True') master_df.to_csv("dask-out-*.csv")
My high-level-graph looks like this:
Is there a better way of doing this or a resource friendlier way? Am I missing something?
I am really confused and kind of desperate but I am happy to join the Dask Community
Btw. I have 16 Gb of RAM
It should be noted that my test-data currently is made of 10 files with around 10kB. I only have one index, it’s always the same. So every datapoint matches every datapoint, it grows exponentially. I basically need every combination. Is this too demanding/dumb?