Hello all,
I installed Dask on an HPC. There was a dependency of mpi4py, which I installed using pip with intel compiler and intel MPI. I use the same compiler and MPI modules to run a batch job using Dask-MPI. I’m trying to run the following code:
import time
start_time = time.perf_counter()
from dask.distributed import Client
from dask_mpi import initialize
import dask.array as da
from dask_ml.linear_model import LinearRegression
from dask_ml.model_selection import train_test_split
if __name__=="__main__":
client = Client()
X = da.linspace(-10, 10, num=1000000).reshape(-1, 1)
Y = X ** 2
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
end_time = time.perf_counter()
I am using the following batch job submission script:
#SBATCH --output="%x.%j.out"
#SBATCH --export=ALL
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=64
#SBATCH --mem=249208M
#SBATCH -t 48:00:00
module load cpu/0.15.4
module load intel
module load intel-mpi
module load intel-mkl
source ~/.bashrc
conda activate ml
mpirun -np ${SLURM_NTASKS} python run_batch.py
I use a conda environment “ml” for all machine learning-based packages.
“compute” implies that I’m using the full node for the computation.
With the above configuration, I get the following error:
The following have been reloaded with a version change:
1) cpu/0.17.3b => cpu/0.15.4
2024-11-05 18:46:49,306 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2024-11-05 18:46:49,329 - distributed.scheduler - INFO - State start
2024-11-05 18:46:49,337 - distributed.scheduler - INFO - Scheduler at: tcp://
2024-11-05 18:46:49,337 - distributed.scheduler - INFO - dashboard at:
2024-11-05 18:46:49,337 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-11-05 18:46:49,349 - distributed.scheduler - INFO - Receive client connection: Client-5ee34b82-9be9-11ef-928c-1c34da62ac20
2024-11-05 18:46:49,352 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 18:46:49,888 - distributed.worker - INFO - Start worker at: tcp://
2024-11-05 18:46:49,889 - distributed.worker - INFO - Listening to: tcp://
2024-11-05 18:46:49,889 - distributed.worker - INFO - Worker name: 58
2024-11-05 18:46:49,889 - distributed.worker - INFO - dashboard at:
2024-11-05 18:46:49,889 - distributed.worker - INFO - Waiting to connect to: tcp://
2024-11-05 18:46:49,889 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:49,889 - distributed.worker - INFO - Threads: 1
2024-11-05 18:46:49,889 - distributed.worker - INFO - Memory: 121.68 GiB
2024-11-05 18:46:49,889 - distributed.worker - INFO - Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-b1f4i5x0
2024-11-05 18:46:49,889 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:49,894 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://', name: 58, status: init, memory: 0, processing: 0>
2024-11-05 18:46:49,895 - distributed.scheduler - INFO - Starting worker compute stream, tcp://
2024-11-05 18:46:49,895 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 18:46:49,895 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-11-05 18:46:49,895 - distributed.worker - INFO - Registered to: tcp://
2024-11-05 18:46:49,895 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:49,896 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 18:46:50,000 - distributed.worker - INFO - Start worker at: tcp://
2024-11-05 18:46:50,000 - distributed.worker - INFO - Listening to: tcp://
2024-11-05 18:46:50,000 - distributed.worker - INFO - Worker name: 18
2024-11-05 18:46:50,000 - distributed.worker - INFO - dashboard at:
2024-11-05 18:46:50,000 - distributed.worker - INFO - Waiting to connect to: tcp://
2024-11-05 18:46:50,000 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,000 - distributed.worker - INFO - Threads: 1
2024-11-05 18:46:50,000 - distributed.worker - INFO - Memory: 121.68 GiB
2024-11-05 18:46:50,000 - distributed.worker - INFO - Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-k_cx7yu7
2024-11-05 18:46:50,000 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,005 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://', name: 18, status: init, memory: 0, processing: 0>
2024-11-05 18:46:50,006 - distributed.scheduler - INFO - Starting worker compute stream, tcp://
2024-11-05 18:46:50,006 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 18:46:50,006 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-11-05 18:46:50,006 - distributed.worker - INFO - Registered to: tcp://
2024-11-05 18:46:50,006 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,006 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Start worker at: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Listening to: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Start worker at: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Listening to: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Start worker at: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Listening to: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Worker name: 3
2024-11-05 18:46:50,019 - distributed.worker - INFO - Worker name: 10
2024-11-05 18:46:50,019 - distributed.worker - INFO - dashboard at:
2024-11-05 18:46:50,019 - distributed.worker - INFO - Waiting to connect to: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Start worker at: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Listening to: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - Worker name: 2
2024-11-05 18:46:50,019 - distributed.worker - INFO - dashboard at:
2024-11-05 18:46:50,019 - distributed.worker - INFO - dashboard at:
2024-11-05 18:46:50,019 - distributed.worker - INFO - Waiting to connect to: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO - Threads: 1
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO - Threads: 1
2024-11-05 18:46:50,019 - distributed.worker - INFO - Memory: 121.68 GiB
2024-11-05 18:46:50,019 - distributed.worker - INFO - Waiting to connect to: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO - Memory: 121.68 GiB
2024-11-05 18:46:50,019 - distributed.worker - INFO - Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-2pgbm0wv
2024-11-05 18:46:50,019 - distributed.worker - INFO - Worker name: 19
2024-11-05 18:46:50,019 - distributed.worker - INFO - dashboard at:
2024-11-05 18:46:50,019 - distributed.worker - INFO - Threads: 1
2024-11-05 18:46:50,019 - distributed.worker - INFO - Memory: 121.68 GiB
2024-11-05 18:46:50,019 - distributed.worker - INFO - Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-crus8hmb
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO - Waiting to connect to: tcp://
2024-11-05 18:46:50,019 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 18:46:50,019 - distributed.worker - INFO - Threads: 1
2024-11-05 18:46:50,019 - distributed.worker - INFO - Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-vnamu75r
**And a lot more of these worker messages!**
2024-11-05 18:46:50,719 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 18:46:50,968 - distributed.scheduler - INFO - Receive client connection: Client-5fd67b4c-9be9-11ef-928c-1c34da62ac20
2024-11-05 18:46:50,968 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 18:46:50,970 - distributed.worker - INFO - Run out-of-band function 'stop'
2024-11-05 18:46:50,970 - distributed.scheduler - INFO - Scheduler closing due to unknown reason...
2024-11-05 18:46:50,970 - distributed.scheduler - INFO - Scheduler closing all comms
2024-11-05 18:46:50,972 - distributed.worker - INFO - Stopping worker at tcp:// Reason: scheduler-close
2024-11-05 18:46:50,972 - distributed.worker - INFO - Stopping worker at tcp:// Reason: scheduler-close
2024-11-05 18:46:50,972 - distributed.worker - INFO - Stopping worker at tcp:// Reason: scheduler-close
2024-11-05 18:46:50,972 - distributed.worker - INFO - Stopping worker at tcp:// Reason: scheduler-close
**And a lot more of these scheduler-close messages!**
2024-11-05 18:46:50,981 - distributed.core - INFO - Connection to tcp:// has been closed.
2024-11-05 18:46:50,981 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://', name: 12, status: running, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.9816923')
2024-11-05 18:46:50,982 - distributed.core - INFO - Connection to tcp:// has been closed.
2024-11-05 18:46:50,982 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://', name: 52, status: running, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.98224')
**And a lot more of these Remove worker messages!**
2024-11-05 18:46:50,978 - distributed.batched - INFO - Batched Comm Closed <TCP (closed) Worker->Scheduler local=tcp:// remote=tcp://>
Traceback (most recent call last):
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 297, in write
raise StreamClosedError()
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/batched.py", line 115, in _background_send
nbytes = yield coro
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/tornado/gen.py", line 766, in run
value = future.result()
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 307, in write
convert_stream_closed_error(self, e)
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Worker->Scheduler local=tcp:// remote=tcp://>: Stream is closed
2024-11-05 18:46:50,994 - distributed.core - INFO - Connection to tcp:// has been closed.
2024-11-05 18:46:50,994 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://', name: 28, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.9946446')
2024-11-05 18:46:50,994 - distributed.core - INFO - Connection to tcp:// has been closed.
2024-11-05 18:46:50,994 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://', name: 8, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.994934')
2024-11-05 18:46:50,995 - distributed.core - INFO - Connection to tcp:// has been closed.
2024-11-05 18:46:50,995 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://', name: 36, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.995262')
2024-11-05 18:46:50,995 - distributed.core - INFO - Connection to tcp:// has been closed.
2024-11-05 18:46:50,995 - distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://', name: 7, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1730861210.9955466')
**And a lot more of these Remove worker messages!**
2024-11-05 18:46:51,001 - distributed.scheduler - INFO - Lost all workers
2024-11-05 18:46:51,154 - distributed.client - ERROR -
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/core.py", line 342, in connect
comm = await wait_for(
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/utils.py", line 1956, in wait_for
return await fut
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 559, in connect
convert_stream_closed_error(self, e)
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x15523055ab70>: ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/utils.py", line 832, in wrapper
return await func(*args, **kwargs)
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/client.py", line 1330, in _reconnect
await self._ensure_connected(timeout=timeout)
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/client.py", line 1360, in _ensure_connected
comm = await connect(
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/site-packages/distributed/comm/core.py", line 366, in connect
await asyncio.sleep(backoff)
File "/home/sbhimineni/anaconda3/envs/ml/lib/python3.12/asyncio/tasks.py", line 665, in sleep
return await future
2024-11-05 18:46:51,255 - distributed.core - INFO - Received 'close-stream' from tcp://; closing.
2024-11-05 18:46:51,256 - distributed.core - INFO - Received 'close-stream' from tcp://; closing.
2024-11-05 18:46:51,257 - distributed.core - INFO - Received 'close-stream' from tcp://; closing.
2024-11-05 18:46:51,262 - distributed.core - INFO - Received 'close-stream' from tcp://; closing.
**And a lot more of these messages!**
I used a different configuration for my job, using only 4 cores (with 7788M memory) in a compute node, I got into the following problem:
The following have been reloaded with a version change:
1) cpu/0.17.3b => cpu/0.15.4
2024-11-05 14:11:04,547 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2024-11-05 14:11:04,594 - distributed.scheduler - INFO - State start
2024-11-05 14:11:04,597 - distributed.scheduler - INFO - Scheduler at: tcp://
2024-11-05 14:11:04,598 - distributed.scheduler - INFO - dashboard at:
2024-11-05 14:11:04,598 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2024-11-05 14:11:04,628 - distributed.scheduler - INFO - Receive client connection: Client-d97700b4-9bc2-11ef-b065-1c34da75846a
2024-11-05 14:11:04,631 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 14:11:04,962 - distributed.worker - INFO - Start worker at: tcp://
2024-11-05 14:11:04,962 - distributed.worker - INFO - Listening to: tcp://
2024-11-05 14:11:04,962 - distributed.worker - INFO - Worker name: 3
2024-11-05 14:11:04,962 - distributed.worker - INFO - dashboard at:
2024-11-05 14:11:04,962 - distributed.worker - INFO - Waiting to connect to: tcp://
2024-11-05 14:11:04,962 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,962 - distributed.worker - INFO - Threads: 1
2024-11-05 14:11:04,962 - distributed.worker - INFO - Memory: 243.34 MiB
2024-11-05 14:11:04,962 - distributed.worker - INFO - Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-89qs3ik2
2024-11-05 14:11:04,963 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,978 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://', name: 3, status: init, memory: 0, processing: 0>
2024-11-05 14:11:04,979 - distributed.scheduler - INFO - Starting worker compute stream, tcp://
2024-11-05 14:11:04,979 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 14:11:04,979 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-11-05 14:11:04,980 - distributed.worker - INFO - Registered to: tcp://
2024-11-05 14:11:04,980 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,980 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 14:11:04,982 - distributed.worker - INFO - Start worker at: tcp://
2024-11-05 14:11:04,982 - distributed.worker - INFO - Listening to: tcp://
2024-11-05 14:11:04,983 - distributed.worker - INFO - Worker name: 2
2024-11-05 14:11:04,983 - distributed.worker - INFO - dashboard at:
2024-11-05 14:11:04,983 - distributed.worker - INFO - Waiting to connect to: tcp://
2024-11-05 14:11:04,983 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,983 - distributed.worker - INFO - Threads: 1
2024-11-05 14:11:04,983 - distributed.worker - INFO - Memory: 243.34 MiB
2024-11-05 14:11:04,983 - distributed.worker - INFO - Local Directory: /expanse/lustre/scratch/sbhimineni/temp_project/TPOT_test/dask-scratch-space/worker-f0g5co33
2024-11-05 14:11:04,983 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,988 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://', name: 2, status: init, memory: 0, processing: 0>
2024-11-05 14:11:04,988 - distributed.scheduler - INFO - Starting worker compute stream, tcp://
2024-11-05 14:11:04,988 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 14:11:04,988 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-11-05 14:11:04,989 - distributed.worker - INFO - Registered to: tcp://
2024-11-05 14:11:04,989 - distributed.worker - INFO - -------------------------------------------------
2024-11-05 14:11:04,989 - distributed.core - INFO - Starting established connection to tcp://
2024-11-05 14:11:05,159 - distributed.worker.memory - WARNING - Worker is at 162% memory usage. Pausing worker. Process memory: 395.18 MiB -- Worker memory limit: 243.34 MiB
2024-11-05 14:11:05,159 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 395.18 MiB -- Worker memory limit: 243.34 MiB
2024-11-05 14:11:05,188 - distributed.worker.memory - WARNING - Worker is at 167% memory usage. Pausing worker. Process memory: 408.41 MiB -- Worker memory limit: 243.34 MiB
2024-11-05 14:11:05,265 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 415.61 MiB -- Worker memory limit: 243.34 MiB
2024-11-05 14:11:15,190 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 395.18 MiB -- Worker memory limit: 243.34 MiB
**The same message continued for a while before I killed the job!**
slurmstepd: error: *** STEP 34887673.0 ON exp-13-53 CANCELLED AT 2024-11-05T15:11:49 ***
slurmstepd: error: *** JOB 34887673 ON exp-13-53 CANCELLED AT 2024-11-05T15:11:49 ***
I am not sure what is going wrong. Any help in resolving this matter would be appreciated.
Thank you.