Hi, new with dask ^^.
I wrote some code with dask and dask-dataframe starting with dask_client = Client(n_workers=4, memory_limit='12GB')
all works fine on my pc. However when I’m using docker and run the same code via container on my pc the code crushes due to memory (Worker process 38 was killed by signal 9
)… It’s the same code. what am I doing wrong?
Hi @koralbaron, welcome to Dask Discourse!
Well, as a first guess, I would say it must come either from how you launch Docker, your docker general configuration, or the container you use for launching the code.
But these are just wild guesses, we would need a Minimum Reproducible Example to be able to really help. At least how are your launching Dask wit Docker?
Right, the first thing I would look at is how much CPU and memory your docker config allows. Also worth asking: why do you want to run this in a docker container when the local cluster alone already worked well?
Ty. I’ve changed the memory my docker config allows and now it is running.
It looks like lots of memory is needed during the saving of the df to parquet because of the compute() function. Is there a way to make compute() use less ram even if it takes longer to finish?
Any way you solved my main issue with the container so we can close this topic ty
Without seeing your code, it is hard to comment. However, saving to parquet does not normally involve a call to compute() at all, since the dataframe has a .to_parquet() method already that doesn’t need it.
I meant inside the implementation of to_parquet method there is a calling to compute().
and I guess because there is a lot of data to compute it uses lots of ram. I just wonder if maybe there is a way to limit ram usage during the .to_parquet() method
That call is necessary in order for anything to happen at all. So unless you can make smaller partitions or a smaller number of parallel tasks, memory use will be what it is.
that’s exactly what I’m looking for
how can I do that? how to control the number of parallel tasks?
how can I do that? how to control the number of parallel tasks?
Set the number of threads per worker threads_per_worker=
in your call to Clien().
that does the trick ty very much