HTCondorCluster failed to start - ConnectionRefusedError: [Errno 111] Connection refused

Earl_Russell_Almazan · September 27, 2024, 1:06am

Running on UChicago and was waiting for compute() with an HTCondorCluster running:

cluster = HTCondorCluster(log_directory="path/to/log/", cores=5, memory="20GB", disk="5GB")
output = [ ]
for i in loop:
     output.append(dask.delayed(function)(parameters[i]))
cluster.scale(jobs=len(output))
client = Client(cluster)
dask.compute(*output)

I’m seeing a few jobs get submitted, but they look to fail and after a few minutes I got this error:

/cvmfs/sft.cern.ch/lcg/releases/LCG_105/distributed/2023.7.1/x86_64-el9-gcc13-opt/lib/python3.9/site-packages/distributed/node.py:182: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 24469 instead
  warnings.warn(
ConnectionRefusedError: [Errno 111] Connection refused

What could be the cause of this / what steps can I take to diagnose what is happening?

guillaumeeb · September 27, 2024, 1:50pm

Hi @Earl_Russell_Almazan, welcome to Dask community!

Best thing to do first, is to look at the stdout/stderr files of your job submission system for the Dask worker jobs. You might also find interesting debugging methods here:
https://jobqueue.dask.org/en/latest/debug.html

This error seems to be just a Warning, default Scheduler port is already in use.

Topic		Replies	Views
Unable to start SSHCluster Distributed distributed	5	875	August 18, 2022
General cause/scenarios for `worker-handle-scheduler-connection-broken` error Distributed dask-gateway , distributed	8	1200	November 3, 2023
Running a cluster on an unreliable network Distributed dask-jobqueue , distributed	4	151	July 13, 2024
Dask cluster with large number of workers gives "asyncio.exceptions.TimeoutError: Nanny failed to start" Distributed distributed	4	2161	April 27, 2022
Dask scheduler in a docker container, workers as HTCondor jobs Distributed	10	1080	February 28, 2022

HTCondorCluster failed to start - ConnectionRefusedError: [Errno 111] Connection refused

Related topics