Hey !
I have following problem: I want to join multiple DataFrames but I always run out of memory! My goal is to achieve an operation like polars.concat(how=‘diagonal’) or repeating outer joins.
I have simplified my code to this:
trace_path = Path('x')
dfs = []
for span_path in trace_path.iterdir():
df = dd.read_csv(span_path)
df.repartition(npartitions=1)
df.set_index('id', sorted=True)
dfs.append(df)
master_df = dfs[0]
for df in dfs[1:]:
intersecting_columns = list(set(master_df.columns).intersection(set(df.columns)))
master_df = master_df.merge(df, how='outer', on=intersecting_columns)
master_df = master_df.repartition(npartitions=10)
master_df.dask.visualize(filename='high-test.svg', optimize_graph='True')
master_df.visualize(filename='low-test.svg', optimize_graph='True')
master_df.to_csv("dask-out-*.csv")
My high-level-graph looks like this:
Is there a better way of doing this or a resource friendlier way? Am I missing something?
I am really confused and kind of desperate but I am happy to join the Dask Community
Btw. I have 16 Gb of RAM
Edit.:
It should be noted that my test-data currently is made of 10 files with around 10kB. I only have one index, it’s always the same. So every datapoint matches every datapoint, it grows exponentially. I basically need every combination. Is this too demanding/dumb?