I want to use the adaptive scaling for running jobs on HPC clusters, but it keeps crashing after a while. Using the exact same code by static scaling works perfectly. I have reduced my project to a minimal failing example:
import time
from dask_jobqueue import SLURMCluster
from distributed import Client, progress
cluster = SLURMCluster()
cluster.adapt(minimum_jobs=1, maximum_jobs=4)
def dummy_work(my_arg):
print(my_arg)
time.sleep(600)
return my_arg*2
args = range(10)
with Client(cluster) as client:
fut = client.map(dummy_work, args)
progress(fut, interval=10.0)
res = client.gather(fut)
print(res)
args = range(100,110)
with Client(cluster) as client:
fut = client.map(dummy_work, args)
progress(fut, interval=10.0)
res = client.gather(fut)
print(res)
args = range(200,230)
with Client(cluster) as client:
fut = client.map(dummy_work, args)
progress(fut, interval=10.0)
res = client.gather(fut)
print(res)
print("SUCCESS")
When running this code, it initially works fine, but upon completion of the last client.map
call, I get this error:
Traceback (most recent call last):
File "dummy_dask.py", line 31, in <module>
res = client.gather(fut)
File "/home/raphael/env/lib/python3.7/site-packages/distributed/client.py", line 2152, in gather
asynchronous=asynchronous,
File "/home/raphael/env/lib/python3.7/site-packages/distributed/utils.py", line 310, in sync
self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
File "/home/raphael/env/lib/python3.7/site-packages/distributed/utils.py", line 376, in sync
raise exc.with_traceback(tb)
File "/home/raphael/env/lib/python3.7/site-packages/distributed/utils.py", line 349, in f
result = yield future
File "/home/raphael/env/lib/python3.7/site-packages/tornado/gen.py", line 735, in run
value = future.result()
File "/home/raphael/env/lib/python3.7/site-packages/distributed/client.py", line 2009, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: ('dummy_work-bf3f0e9bfdc71bfb52bb831ade5f40fd', <WorkerState 'tcp://10.0.11.21:46643', name: SLURMCluster-2-5, status: closed, memory: 0, processing: 2>)
distributed.utils - ERROR -
Traceback (most recent call last):
File "/home/raphael/env/lib/python3.7/site-packages/distributed/utils.py", line 693, in log_errors
yield
File "/home/raphael/env/lib/python3.7/site-packages/distributed/scheduler.py", line 7116, in feed
await asyncio.sleep(interval)
File "/home/raphael/env/lib/python3.7/asyncio/tasks.py", line 595, in sleep
return await future
concurrent.futures._base.CancelledError
I’ve messed around a lot to try to get it to work, but it always gives this error. Does anyone have experience with this?
dask_jobqueue==0.7.2
dask==2022.2.0
distributed==2022.2.0