Dask-distributed RDataFrame on a SlurmCluster

Jindrich · November 9, 2024, 11:32pm

Dear experts,

When we were working with the ROOT framework, we attempted to use a distributed ROOT.RDF.Experimental.Distributed.Dask.RDataFrame on a Slurm cluster.
We encountered some issues with this.

Initially, this issue was reported on the ROOT forum, but it may actually be related to an issue with the SlurmCluster in dask-distributed rather than with RDataFrame .

I would like to kindly ask for your assistance.

Best,
Jindrich

guillaumeeb · November 15, 2024, 3:34pm

Hi @Jindrich, welcome to Dask community!

Interesting use case. However, I’m not sure if I understand all things correctly. Is the only thing that is bothering you the final startrace:

2024-08-12 12:48:03,684 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler->Client local=tcp://18.4.134.165:6069 remote=tcp://18.4.134.165:31608>
Traceback (most recent call last):
  File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
             ^^^^^^^^^^
  File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/work/submit/lavezzo/miniforge3/envs/rootdf3/lib/python3.11/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError

?

I mean, is you workflow ending correctly? If so, this is just a bit anoying output from a closed cluster that ended not quite well, but can be ignored.

Jindrich · November 15, 2024, 5:04pm

Thank you for your reply.
Yes, the only thing that is bothering me is the final traceback.

I believe my workflow is ending correctly - but I can double check it.
Ok, I will ignore it.

Best,
Jindrich

guillaumeeb · November 15, 2024, 8:46pm

Ok, so this stacktrace is indeed a problem identified in dask-jobqueue, not sure from where it comes, but if it does not affect your computation, you can ignore it.

Topic		Replies	Views
My code works with LocalCluster but, not with SLURMCluster Distributed	4	453	September 17, 2022
OutOfBoundsDatetime error when running Dask distributed on SLURM Distributed delayed	2	70	May 24, 2024
dask_jobqueue.SLURMCluster: multi-threaded workloads and the effect of setting "cores" Distributed dask-jobqueue , distributed	2	256	November 9, 2023
[Best practice] Deploy a cluster on an interactive compute node on a slurm cluster Distributed distributed	2	1137	April 23, 2022
Advice on how to structure Dask computation Distributed	7	51	January 16, 2025

Dask-distributed RDataFrame on a SlurmCluster

Related topics