How to use Dask to overcome out-of-memory problem?

hieutv85 · January 31, 2023, 8:08pm

Dear all,

My name is Hieu and I am a Ph.D. student. I am currently handling huge time series data and want to apply library Dask to reduce the size of my data. My data is 15 multivariate time series, each with eight variables, the lengths of each one are about 1.5 million time points. I used sliding windows and my approach to analyzing the time series is Topological data analysis.

I generated geometric features from the time series, and new geometric features consumed a lot of memory. I tested my experiment on some high-performance computers with really good configurations (1024G of RAM), but I still faced the problem of out-of-memory.

I also try to apply library Dask by converting from the input’s normal data frame to the Dask dataframe (before applying geometric transformation). The first time the code ran well but the second time I faced the same problem of out-of-memory. I also attached my code here.

import dask.dataframe as dd
ddf_x = dd.from_pandas(norm_x, npartitions=500)
ddf_y = dd.from_pandas(norm_y, npartitions=500)

Please suggest some better ways to use Dask to overcome my current problem.

Please support me. Since the crash happen again, then my account to access the high-performance computers will be disabled. But I really need this account to test my experiments.

Thank you very much.
Regards,
Hieu

hieutv85 · January 31, 2023, 9:25pm

I attached here an image that shows the memory usage of my code in the high-performance computers, the white part is when Out-of-memory Killer happens.

jupiter3_outmem24dec22

guillaumeeb · February 1, 2023, 7:53am

Hi @hieutv85, welcome to the forum!

First the code snippet you provide is a little too short to understand well what you are trying to do with Pandas and what is your current Dask code. Could you detail this a bit?

If you do ddf_x = dd.from_pandas(norm_x, npartitions=500) It means you first read the DataFrame with Pandas before converting it to Dask, so you load all your input data in memory, this is not what you want.

Instead, you should directly use Dask to read your input data, and apply transformation and reduction to it before either saving it back to disk or converting it to Pandas. But again we need some more code to be able to help you better.

See:

And also

for some good tips to get started.

hieutv85 · February 4, 2023, 12:44pm

Hi @guillaumeeb

Thank you very much for your support. I already read the two attached documents. It is helpful.

My code is long and include some parts which are reading file to data frame, preprocessing, removing some unused data, and several steps of converting from normal data to new form of data. So I do not sure what part should I provide.

Regards,

guillaumeeb · February 4, 2023, 1:20pm

What we need to help you is the minimal version of your code which reproduces the issue.

You need to identify at least where it failed. We can help you then with the why and fix.

If you cannot do that, please at least provide some code explaining how you read the data, some processing example and the final step when you launch the computation or try to write a result.

hieutv85 · February 6, 2023, 10:03pm

Hi @guillaumeeb,

Could you please allow me to have your email?

My code is related to my publication and I haven’t published it yet. So it will be much better if I can provide it privately. Please understand.

Thank you very much. I appreciate your support and your fast replies.

guillaumeeb · February 7, 2023, 4:06pm

Couldn’t you try to extract some key part of the code? I won’t be able to go through all your code to analyze it. I would need some minimal reproducer to be able to help.

You can send me a private message through this forum, but if the code is to complicated, I won’t be able to help.

hieutv85 · February 8, 2023, 11:59am

Thank @guillaumeeb. I already sent you a message with key part of the code.

guillaumeeb · February 14, 2023, 8:02pm

So after private discussion with @hieutv85, it appears that the issue is because he is using giotto-tda library. This library is not Dask compatible, in the sense that it can take advantage of Dask DataFrame or Dask Array. It does not implement any partial_fit method.

Rather than that, it relies on check_array method from ScikitLearn. This method, if given a Dask Array, will just convert it to a Numpy one, loading all the data into a single server memory. So it is pointless to try to use Dask here.

If giotto-tda is a must have, then the optimization should be done elsewhere: reduce the size of the input dataset, taking a sample, taking just a part. Also verifying nothing useless is kept into memory, cleaning a bit the code and freeing Python objects that are of no use, avoid multiplying objects. Also, try to read only the column really needed in the Pandas dataframe.

hieutv85 · February 14, 2023, 9:57pm

Thank you very much for your support @guillaumeeb.

Topic		Replies	Views
Out of memory error faced with Dask Dask Array	2	66	November 8, 2024
Dask computation takes way too much memory Dask DataFrame distributed	5	1031	December 27, 2023
OutOfMemory, when merging multiple dataframes! Help me optimize! Dask DataFrame optimization , high-level-graph	2	1368	October 23, 2023
SVD On Large Data Set (don't fit in memory), MxN same order Dask Array	8	25	February 7, 2025
Cannot calculate simple .mean() on dask.dataframe larger than RAM Dask DataFrame	2	442	January 16, 2023

How to use Dask to overcome out-of-memory problem?

Related topics