Dear experts,
When we were working with the ROOT framework, we attempted to use a distributed ROOT.RDF.Experimental.Distributed.Dask.RDataFrame
on a Slurm cluster.
We encountered some issues with this.
Initially, this issue was reported on the ROOT forum, but it may actually be related to an issue with the SlurmCluster
in dask-distributed
rather than with RDataFrame
.
I would like to kindly ask for your assistance.
Best,
Jindrich
Hi @Jindrich, welcome to Dask community!
Interesting use case. However, I’m not sure if I understand all things correctly. Is the only thing that is bothering you the final startrace:
2024-08-12 12:48:03,684 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler->Client local=tcp://18.4.134.165:6069 remote=tcp://18.4.134.165:31608>
Traceback (most recent call last):
File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/distributed/batched.py", line 115, in _background_send
nbytes = yield coro
^^^^^^^^^^
File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/tornado/gen.py", line 766, in run
value = future.result()
^^^^^^^^^^^^^^^
File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/distributed/comm/tcp.py", line 262, in write
raise CommClosedError()
distributed.comm.core.CommClosedError
?
I mean, is you workflow ending correctly? If so, this is just a bit anoying output from a closed cluster that ended not quite well, but can be ignored.
Thank you for your reply.
Yes, the only thing that is bothering me is the final traceback.
I believe my workflow is ending correctly - but I can double check it.
Ok, I will ignore it.
Best,
Jindrich
Ok, so this stacktrace is indeed a problem identified in dask-jobqueue, not sure from where it comes, but if it does not affect your computation, you can ignore it.