How to use Dask to overcome out-of-memory problem?

Dear all,

My name is Hieu and I am a Ph.D. student. I am currently handling huge time series data and want to apply library Dask to reduce the size of my data. My data is 15 multivariate time series, each with eight variables, the lengths of each one are about 1.5 million time points. I used sliding windows and my approach to analyzing the time series is Topological data analysis.

I generated geometric features from the time series, and new geometric features consumed a lot of memory. I tested my experiment on some high-performance computers with really good configurations (1024G of RAM), but I still faced the problem of out-of-memory.

I also try to apply library Dask by converting from the input’s normal data frame to the Dask dataframe (before applying geometric transformation). The first time the code ran well but the second time I faced the same problem of out-of-memory. I also attached my code here.

import dask.dataframe as dd
ddf_x = dd.from_pandas(norm_x, npartitions=500)
ddf_y = dd.from_pandas(norm_y, npartitions=500)

Please suggest some better ways to use Dask to overcome my current problem.

Please support me. Since the crash happen again, then my account to access the high-performance computers will be disabled. But I really need this account to test my experiments.

Thank you very much.
Regards,
Hieu

I attached here an image that shows the memory usage of my code in the high-performance computers, the white part is when Out-of-memory Killer happens.

jupiter3_outmem24dec22

Hi @hieutv85, welcome to the forum!

First the code snippet you provide is a little too short to understand well what you are trying to do with Pandas and what is your current Dask code. Could you detail this a bit?

If you do ddf_x = dd.from_pandas(norm_x, npartitions=500) It means you first read the DataFrame with Pandas before converting it to Dask, so you load all your input data in memory, this is not what you want.

Instead, you should directly use Dask to read your input data, and apply transformation and reduction to it before either saving it back to disk or converting it to Pandas. But again we need some more code to be able to help you better.

See:

And also

for some good tips to get started.

Hi @guillaumeeb

Thank you very much for your support. I already read the two attached documents. It is helpful.

My code is long and include some parts which are reading file to data frame, preprocessing, removing some unused data, and several steps of converting from normal data to new form of data. So I do not sure what part should I provide.

Regards,

What we need to help you is the minimal version of your code which reproduces the issue.

You need to identify at least where it failed. We can help you then with the why and fix.

If you cannot do that, please at least provide some code explaining how you read the data, some processing example and the final step when you launch the computation or try to write a result.

Hi @guillaumeeb,

Could you please allow me to have your email?

My code is related to my publication and I haven’t published it yet. So it will be much better if I can provide it privately. Please understand.

Thank you very much. I appreciate your support and your fast replies.

Couldn’t you try to extract some key part of the code? I won’t be able to go through all your code to analyze it. I would need some minimal reproducer to be able to help.

You can send me a private message through this forum, but if the code is to complicated, I won’t be able to help.

Thank @guillaumeeb. I already sent you a message with key part of the code.

So after private discussion with @hieutv85, it appears that the issue is because he is using giotto-tda library. This library is not Dask compatible, in the sense that it can take advantage of Dask DataFrame or Dask Array. It does not implement any partial_fit method.

Rather than that, it relies on check_array method from ScikitLearn. This method, if given a Dask Array, will just convert it to a Numpy one, loading all the data into a single server memory. So it is pointless to try to use Dask here.

If giotto-tda is a must have, then the optimization should be done elsewhere: reduce the size of the input dataset, taking a sample, taking just a part. Also verifying nothing useless is kept into memory, cleaning a bit the code and freeing Python objects that are of no use, avoid multiplying objects. Also, try to read only the column really needed in the Pandas dataframe.

Thank you very much for your support @guillaumeeb.

1 Like