Hi,
I encounter this issue while analyzing multiple df, each has around 67 million rows. I can compute() or export to_csv any individual df. I am using a for loop to create 50 df and append all of them to a list (I know it is not the best option to use dask with for loop, I am still figuring it out). Then, I take the list of 50 df and concatenate into 1 df with 50 cols and take the mean of it. However, I cannot do the compute() on the final dataframe-mean even though it has the same size with each indiv df which compute() works perfectly. The script gets killed by the OS as it takes up 95% of its allocated memory
I am not sure what happened or if there is any workaround way for it?
Thank you!
import dask.dataframe as dd
import dask.array as da
Results = []
for i in range (50):
fft_input = da.random.randint(low=-20,high=100,size=(67108864,)).rechunk(chunks=(67108864,)) #Create a 1D dask array with 67108864 elements
data=dd.from_dask_array(fft_input) # data.compute() works fine
Results.append(data) # Convert array to dataframe and store into a list
allResult = dd.multi.concat(Results, axis=1) #The list contains 50 separate dataframe and concat as 1 dataframe with 50 columns
dfff = allResult.mean(axis=1) #Take the mean of the concat dataframe
print(dfff)
dfff.compute() #This is when the script gets killed
Here is the result of the above code
Dask Series Structure:
npartitions=1
0 float64
67108863 ...
dtype: float64
Dask Name: dataframe-mean, 402 tasks
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
Environment:
Dask version: 2022.2.1
Python version: 3.8
Operating System: Linux Ubuntu 18.04
Memory: 39.2 Gb
OS type: 64 bit
Install method (conda, pip, source): pip