Dask-distributed RDataFrame on a SlurmCluster

Dear experts,

When we were working with the ROOT framework, we attempted to use a distributed ROOT.RDF.Experimental.Distributed.Dask.RDataFrame on a Slurm cluster.
We encountered some issues with this.

Initially, this issue was reported on the ROOT forum, but it may actually be related to an issue with the SlurmCluster in dask-distributed rather than with RDataFrame .

I would like to kindly ask for your assistance.

Best,
Jindrich

Hi @Jindrich, welcome to Dask community!

Interesting use case. However, I’m not sure if I understand all things correctly. Is the only thing that is bothering you the final startrace:

2024-08-12 12:48:03,684 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler->Client local=tcp://18.4.134.165:6069 remote=tcp://18.4.134.165:31608>
Traceback (most recent call last):
  File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
             ^^^^^^^^^^
  File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError

?

I mean, is you workflow ending correctly? If so, this is just a bit anoying output from a closed cluster that ended not quite well, but can be ignored.

Thank you for your reply.
Yes, the only thing that is bothering me is the final traceback.

I believe my workflow is ending correctly - but I can double check it.
Ok, I will ignore it.

Best,
Jindrich

Ok, so this stacktrace is indeed a problem identified in dask-jobqueue, not sure from where it comes, but if it does not affect your computation, you can ignore it.