Setup of Dask on HPC

Sree_Harsha · November 6, 2024, 5:26pm

Hello all,

I installed Dask on an HPC. There was a dependency of mpi4py, which I installed using pip with intel compiler and intel MPI. I use the same compiler and MPI modules to run a batch job using Dask-MPI. I’m trying to run the following code:

import time

start_time = time.perf_counter()
from dask.distributed import Client
from dask_mpi import initialize
import dask.array as da
from dask_ml.linear_model import LinearRegression
from dask_ml.model_selection import train_test_split

if __name__=="__main__":
    initialize()
    client = Client()
    
    X = da.linspace(-10, 10, num=1000000).reshape(-1, 1)
    Y = X ** 2

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
    model = LinearRegression()
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_test)
end_time = time.perf_counter()
print(end_time-start_time)

I am using the following batch job submission script:

#!/bin/bash
#SBATCH --output="%x.%j.out"
#SBATCH --export=ALL
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=64
#SBATCH --mem=249208M
#SBATCH -t 48:00:00

module load cpu/0.15.4
module load intel
module load intel-mpi
module load intel-mkl

source ~/.bashrc
conda activate ml

mpirun -np ${SLURM_NTASKS} python run_batch.py

I use a conda environment “ml” for all machine learning-based packages.
“compute” implies that I’m using the full node for the computation.
With the above configuration, I get the following error:

The following have been reloaded with a version change:
  1) cpu/0.17.3b => cpu/0.15.4

2024-11-05 18:46:49,306 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2024-11-05 18:46:49,329 - distributed.scheduler - INFO - State start
2024-11-05 18:46:49,337 - distributed.scheduler - INFO -   Scheduler at: tcp://198.202.102.253:39035
2024-11-05 18:46:49,337 - distributed.scheduler - INFO -   dashboard at:  http://198.202.102.253:8787/status
2024-11-05 18:46:49,337 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-11-05 18:46:49,349 - distributed.scheduler - INFO - Receive client connection: Client-5ee34b82-9be9-11ef-928c-1c34da62ac20
2024-11-05 18:46:49,352 - distributed.core - INFO - Starting established connection to tcp://198.202.102.253:52660
2024-11-05 18:46:49,888 - distributed.worker - INFO -       Start worker at: tcp://198.202.102.253:33411
2024-11-05 18:46:49,889 - distributed.worker - INFO -          Listening to: tcp://198.202.102.253:33411
2024-11-05 18:46:49,889 - distributed.worker - INFO -           Worker name:                         58
2024-11-05 18:46:49,889 - distributed.worker - INFO -          dashboard at:      198.202.102.253:36563
2024-11-05 18:46:49,889 - distributed.worker - INFO - Waiting to connect to: tcp://198.202.102.253:39035
2024-11-05 18:46:49,889 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:49,889 - distributed.worker - INFO -               Threads:                          1
2024-11-05 18:46:49,889 - distributed.worker - INFO -                Memory:                 121.68 GiB
2024-11-05 18:46:49,889 - distributed.worker - INFO -       Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-b1f4i5x0
2024-11-05 18:46:49,889 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:49,894 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://198.202.102.253:33411', name: 58, status: init, memory: 0, processing: 0>
2024-11-05 18:46:49,895 - distributed.scheduler - INFO - Starting worker compute stream, tcp://198.202.102.253:33411
2024-11-05 18:46:49,895 - distributed.core - INFO - Starting established connection to tcp://198.202.102.253:52668
2024-11-05 18:46:49,895 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-11-05 18:46:49,895 - distributed.worker - INFO -         Registered to: tcp://198.202.102.253:39035
2024-11-05 18:46:49,895 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:49,896 - distributed.core - INFO - Starting established connection to tcp://198.202.102.253:39035
2024-11-05 18:46:50,000 - distributed.worker - INFO -       Start worker at: tcp://198.202.102.253:38387
2024-11-05 18:46:50,000 - distributed.worker - INFO -          Listening to: tcp://198.202.102.253:38387
2024-11-05 18:46:50,000 - distributed.worker - INFO -           Worker name:                         18
2024-11-05 18:46:50,000 - distributed.worker - INFO -          dashboard at:      198.202.102.253:39177
2024-11-05 18:46:50,000 - distributed.worker - INFO - Waiting to connect to: tcp://198.202.102.253:39035
2024-11-05 18:46:50,000 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,000 - distributed.worker - INFO -               Threads:                          1
2024-11-05 18:46:50,000 - distributed.worker - INFO -                Memory:                 121.68 GiB
2024-11-05 18:46:50,000 - distributed.worker - INFO -       Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-k_cx7yu7
2024-11-05 18:46:50,000 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,005 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://198.202.102.253:38387', name: 18, status: init, memory: 0, processing: 0>
2024-11-05 18:46:50,006 - distributed.scheduler - INFO - Starting worker compute stream, tcp://198.202.102.253:38387
2024-11-05 18:46:50,006 - distributed.core - INFO - Starting established connection to tcp://198.202.102.253:52680
2024-11-05 18:46:50,006 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-11-05 18:46:50,006 - distributed.worker - INFO -         Registered to: tcp://198.202.102.253:39035
2024-11-05 18:46:50,006 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,006 - distributed.core - INFO - Starting established connection to tcp://198.202.102.253:39035
2024-11-05 18:46:50,019 - distributed.worker - INFO -       Start worker at: tcp://198.202.102.253:44205
2024-11-05 18:46:50,019 - distributed.worker - INFO -          Listening to: tcp://198.202.102.253:44205
2024-11-05 18:46:50,019 - distributed.worker - INFO -       Start worker at: tcp://198.202.102.253:37685
2024-11-05 18:46:50,019 - distributed.worker - INFO -          Listening to: tcp://198.202.102.253:37685
2024-11-05 18:46:50,019 - distributed.worker - INFO -       Start worker at: tcp://198.202.102.253:39047
2024-11-05 18:46:50,019 - distributed.worker - INFO -          Listening to: tcp://198.202.102.253:39047
2024-11-05 18:46:50,019 - distributed.worker - INFO -           Worker name:                          3
2024-11-05 18:46:50,019 - distributed.worker - INFO -           Worker name:                         10
2024-11-05 18:46:50,019 - distributed.worker - INFO -          dashboard at:      198.202.102.253:34053
2024-11-05 18:46:50,019 - distributed.worker - INFO - Waiting to connect to: tcp://198.202.102.253:39035
2024-11-05 18:46:50,019 - distributed.worker - INFO -       Start worker at: tcp://198.202.102.253:43123
2024-11-05 18:46:50,019 - distributed.worker - INFO -          Listening to: tcp://198.202.102.253:43123
2024-11-05 18:46:50,019 - distributed.worker - INFO -           Worker name:                          2
2024-11-05 18:46:50,019 - distributed.worker - INFO -          dashboard at:      198.202.102.253:43253
2024-11-05 18:46:50,019 - distributed.worker - INFO -          dashboard at:      198.202.102.253:45053
2024-11-05 18:46:50,019 - distributed.worker - INFO - Waiting to connect to: tcp://198.202.102.253:39035
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO -               Threads:                          1
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO -               Threads:                          1
2024-11-05 18:46:50,019 - distributed.worker - INFO -                Memory:                 121.68 GiB
2024-11-05 18:46:50,019 - distributed.worker - INFO - Waiting to connect to: tcp://198.202.102.253:39035
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO -                Memory:                 121.68 GiB
2024-11-05 18:46:50,019 - distributed.worker - INFO -       Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-2pgbm0wv
2024-11-05 18:46:50,019 - distributed.worker - INFO -           Worker name:                         19
2024-11-05 18:46:50,019 - distributed.worker - INFO -          dashboard at:      198.202.102.253:46905
2024-11-05 18:46:50,019 - distributed.worker - INFO -               Threads:                          1
2024-11-05 18:46:50,019 - distributed.worker - INFO -                Memory:                 121.68 GiB
2024-11-05 18:46:50,019 - distributed.worker - INFO -       Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-crus8hmb
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO - Waiting to connect to: tcp://198.202.102.253:39035
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO -               Threads:                          1
2024-11-05 18:46:50,019 - distributed.worker - INFO -       Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-vnamu75r
**And a lot more of these worker messages!**
2024-11-05 18:46:50,719 - distributed.core - INFO - Starting established connection to tcp://198.202.102.253:39035
15.355942100053653
2024-11-05 18:46:50,968 - distributed.scheduler - INFO - Receive client connection: Client-5fd67b4c-9be9-11ef-928c-1c34da62ac20
2024-11-05 18:46:50,968 - distributed.core - INFO - Starting established connection to tcp://198.202.102.253:53202
2024-11-05 18:46:50,970 - distributed.worker - INFO - Run out-of-band function 'stop'
2024-11-05 18:46:50,970 - distributed.scheduler - INFO - Scheduler closing due to unknown reason...
2024-11-05 18:46:50,970 - distributed.scheduler - INFO - Scheduler closing all comms
2024-11-05 18:46:50,972 - distributed.worker - INFO - Stopping worker at tcp://198.202.102.253:33411. Reason: scheduler-close
2024-11-05 18:46:50,972 - distributed.worker - INFO - Stopping worker at tcp://198.202.102.253:38387. Reason: scheduler-close
2024-11-05 18:46:50,972 - distributed.worker - INFO - Stopping worker at tcp://198.202.102.253:44205. Reason: scheduler-close
2024-11-05 18:46:50,972 - distributed.worker - INFO - Stopping worker at tcp://198.202.102.253:43123. Reason: scheduler-close
**And a lot more of these scheduler-close messages!**
2024-11-05 18:46:50,981 - distributed.core - INFO - Connection to tcp://198.202.102.253:52978 has been closed.
2024-11-05 18:46:50,981 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://198.202.102.253:45663', name: 12, status: running, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.9816923')
2024-11-05 18:46:50,982 - distributed.core - INFO - Connection to tcp://198.202.102.253:53084 has been closed.
2024-11-05 18:46:50,982 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://198.202.102.253:46443', name: 52, status: running, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.98224')
**And a lot more of these Remove worker messages!**
2024-11-05 18:46:50,978 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp://198.202.102.253:52978 remote=tcp://198.202.102.253:39035>
Traceback (most recent call last):
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 297, in write
    raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
             ^^^^^^^^^^
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
            ^^^^^^^^^^^^^^^
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 307, in write
    convert_stream_closed_error(self, e)
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp://198.202.102.253:52978 remote=tcp://198.202.102.253:39035>: Stream is closed
2024-11-05 18:46:50,994 - distributed.core - INFO - Connection to tcp://198.202.102.253:52806 has been closed.
2024-11-05 18:46:50,994 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://198.202.102.253:40611', name: 28, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.9946446')
2024-11-05 18:46:50,994 - distributed.core - INFO - Connection to tcp://198.202.102.253:52872 has been closed.
2024-11-05 18:46:50,994 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://198.202.102.253:42099', name: 8, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.994934')
2024-11-05 18:46:50,995 - distributed.core - INFO - Connection to tcp://198.202.102.253:52742 has been closed.
2024-11-05 18:46:50,995 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://198.202.102.253:42153', name: 36, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.995262')
2024-11-05 18:46:50,995 - distributed.core - INFO - Connection to tcp://198.202.102.253:52846 has been closed.
2024-11-05 18:46:50,995 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://198.202.102.253:45159', name: 7, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.9955466')
**And a lot more of these Remove worker messages!**
2024-11-05 18:46:51,001 - distributed.scheduler - INFO - Lost all workers
2024-11-05 18:46:51,154 - distributed.client - ERROR - 
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/core.py", line 342, in connect
    comm = await wait_for(
           ^^^^^^^^^^^^^^^
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/utils.py", line 1956, in wait_for
    return await fut
           ^^^^^^^^^
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 559, in connect
    convert_stream_closed_error(self, e)
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x15523055ab70>: ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/utils.py", line 832, in wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/client.py", line 1330, in _reconnect
    await self._ensure_connected(timeout=timeout)
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/client.py", line 1360, in _ensure_connected
    comm = await connect(
           ^^^^^^^^^^^^^^
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/core.py", line 366, in connect
    await asyncio.sleep(backoff)
  File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/asyncio/tasks.py", line 665, in sleep
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError
2024-11-05 18:46:51,255 - distributed.core - INFO - Received 'close-stream' from tcp://198.202.102.253:39035; closing.
2024-11-05 18:46:51,256 - distributed.core - INFO - Received 'close-stream' from tcp://198.202.102.253:39035; closing.
2024-11-05 18:46:51,257 - distributed.core - INFO - Received 'close-stream' from tcp://198.202.102.253:39035; closing.
2024-11-05 18:46:51,262 - distributed.core - INFO - Received 'close-stream' from tcp://198.202.102.253:39035; closing.
**And a lot more of these messages!**

I used a different configuration for my job, using only 4 cores (with 7788M memory) in a compute node, I got into the following problem:


The following have been reloaded with a version change:
  1) cpu/0.17.3b => cpu/0.15.4

2024-11-05 14:11:04,547 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2024-11-05 14:11:04,594 - distributed.scheduler - INFO - State start
2024-11-05 14:11:04,597 - distributed.scheduler - INFO -   Scheduler at: tcp://198.202.101.50:42251
2024-11-05 14:11:04,598 - distributed.scheduler - INFO -   dashboard at:  http://198.202.101.50:8787/status
2024-11-05 14:11:04,598 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-11-05 14:11:04,628 - distributed.scheduler - INFO - Receive client connection: Client-d97700b4-9bc2-11ef-b065-1c34da75846a
2024-11-05 14:11:04,631 - distributed.core - INFO - Starting established connection to tcp://198.202.101.50:45292
2024-11-05 14:11:04,962 - distributed.worker - INFO -       Start worker at: tcp://198.202.101.50:34849
2024-11-05 14:11:04,962 - distributed.worker - INFO -          Listening to: tcp://198.202.101.50:34849
2024-11-05 14:11:04,962 - distributed.worker - INFO -           Worker name:                          3
2024-11-05 14:11:04,962 - distributed.worker - INFO -          dashboard at:       198.202.101.50:33095
2024-11-05 14:11:04,962 - distributed.worker - INFO - Waiting to connect to: tcp://198.202.101.50:42251
2024-11-05 14:11:04,962 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,962 - distributed.worker - INFO -               Threads:                          1
2024-11-05 14:11:04,962 - distributed.worker - INFO -                Memory:                 243.34 MiB
2024-11-05 14:11:04,962 - distributed.worker - INFO -       Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-89qs3ik2
2024-11-05 14:11:04,963 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,978 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://198.202.101.50:34849', name: 3, status: init, memory: 0, processing: 0>
2024-11-05 14:11:04,979 - distributed.scheduler - INFO - Starting worker compute stream, tcp://198.202.101.50:34849
2024-11-05 14:11:04,979 - distributed.core - INFO - Starting established connection to tcp://198.202.101.50:45304
2024-11-05 14:11:04,979 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-11-05 14:11:04,980 - distributed.worker - INFO -         Registered to: tcp://198.202.101.50:42251
2024-11-05 14:11:04,980 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,980 - distributed.core - INFO - Starting established connection to tcp://198.202.101.50:42251
2024-11-05 14:11:04,982 - distributed.worker - INFO -       Start worker at: tcp://198.202.101.50:42555
2024-11-05 14:11:04,982 - distributed.worker - INFO -          Listening to: tcp://198.202.101.50:42555
2024-11-05 14:11:04,983 - distributed.worker - INFO -           Worker name:                          2
2024-11-05 14:11:04,983 - distributed.worker - INFO -          dashboard at:       198.202.101.50:38747
2024-11-05 14:11:04,983 - distributed.worker - INFO - Waiting to connect to: tcp://198.202.101.50:42251
2024-11-05 14:11:04,983 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,983 - distributed.worker - INFO -               Threads:                          1
2024-11-05 14:11:04,983 - distributed.worker - INFO -                Memory:                 243.34 MiB
2024-11-05 14:11:04,983 - distributed.worker - INFO -       Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-f0g5co33
2024-11-05 14:11:04,983 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,988 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://198.202.101.50:42555', name: 2, status: init, memory: 0, processing: 0>
2024-11-05 14:11:04,988 - distributed.scheduler - INFO - Starting worker compute stream, tcp://198.202.101.50:42555
2024-11-05 14:11:04,988 - distributed.core - INFO - Starting established connection to tcp://198.202.101.50:45320
2024-11-05 14:11:04,988 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-11-05 14:11:04,989 - distributed.worker - INFO -         Registered to: tcp://198.202.101.50:42251
2024-11-05 14:11:04,989 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,989 - distributed.core - INFO - Starting established connection to tcp://198.202.101.50:42251
2024-11-05 14:11:05,159 - distributed.worker.memory - WARNING - Worker is at 162% memory usage. Pausing worker.  Process memory: 395.18 MiB -- Worker memory limit: 243.34 MiB
2024-11-05 14:11:05,159 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 395.18 MiB -- Worker memory limit: 243.34 MiB
2024-11-05 14:11:05,188 - distributed.worker.memory - WARNING - Worker is at 167% memory usage. Pausing worker.  Process memory: 408.41 MiB -- Worker memory limit: 243.34 MiB
2024-11-05 14:11:05,265 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 415.61 MiB -- Worker memory limit: 243.34 MiB
2024-11-05 14:11:15,190 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 395.18 MiB -- Worker memory limit: 243.34 MiB
**The same message continued for a while before I killed the job!**
slurmstepd: error: *** STEP 34887673.0 ON exp-13-53 CANCELLED AT 2024-11-05T15:11:49 ***
slurmstepd: error: *** JOB 34887673 ON exp-13-53 CANCELLED AT 2024-11-05T15:11:49 ***

I am not sure what is going wrong. Any help in resolving this matter would be appreciated.

Thank you.

jacobtomlinson · November 6, 2024, 6:22pm

It looks to me like your code is running as expected. Before the error I see the scheduler closing down the cluster.

2024-11-05 18:46:50,970 - distributed.scheduler - INFO - Scheduler closing all comms

This suggests to me that your code is working fine, but then some exception is happening during shutdown. This shouldn’t affect you code, it’s just annoying.

Are you seeing specific issues with your code running?

Sree_Harsha · November 6, 2024, 10:43pm

If I run the code without the Dask client, the code runs fine. There are no errors. I think it is some parallelization issue.

guillaumeeb · November 8, 2024, 3:49pm

Hi @Sree_Harsha,

In your first attempt with a full node, the only thing that looks weird to me is the fact that there is only 2 milliseconds between client connection and the following message (which is a bit weird also):

In how much time? Are you sure the code is not running fine in the first attempt?

In your second attempt:

There is a problem of Worker memory limit, it is really too low, how did you configure the Cluster?

Topic		Replies	Views
Dask-mpi and mpirun - Can't scale up run with larger files and or more core/node on HPC cluster Distributed delayed , future , distributed	2	40	September 20, 2024
Dask deployment on SLURM Cluster with GPUs Deploying Dask dask-mpi , distributed	7	332	May 20, 2024
Duplicated computations using Dask-MPI on multiple nodes of a SLURM HPC Deploying Dask	2	170	February 9, 2024
Run a single task per worker with dask-mpi Distributed dask-mpi	1	774	September 29, 2022
Start dask-mpi problem Distributed dask-mpi , distributed	4	678	May 4, 2023

Setup of Dask on HPC

Related topics