Dear group members
Previously, I asked a question here and the solution works well.
Now, I’m working on a huge dataset. I use 50 slurm jobs, each of them 5 hours.
cluster = SLURMCluster(cores=1, processes=1, memory=str("20GB"), walltime= "05:00:00")
cluster.scale(50)
So my queue size is 50. I don’t have multi-threading.I submit new jobs and gather them in my code. The output of each function is written in a pickle file and this is being read in another function based on the dependency graph. So, I don’t use that much memory among nodes.
When I have 20 inputs and for each of them I use dask.submit for 300 tasks (which depend on each other in the my function infer_hogs_for_rhog_levels_recursively_future
), so I will have 600 jobs, but because of dependency, some of them are not computed at first, but gradually all of them finish. This works well in around one hour thanks to Dask.
But when I have 1000 inputs, each of them needs 300 dask.submit. Dask start working and after 30 mins or 1 hour, many of tasks got erred.
(My final goal is to run the code for 50,000 inputs each of them 900 tasks.)
The slurm output of the node which main.py is running
2022-10-28 14:51:24 INFO python code - dask started
2022-10-28 15:17:31,147 - distributed.scheduler - ERROR - Couldn't gather keys {'infer_hogs_for_rhog_levels_recursively_future-c7e6b31e27dcba1e87e43d898aba9c49': ['tcp://10.203.101.102:35255']} state: ['memory'] workers: ['tcp://10.203.101.102:35255']
NoneType: None
2022-10-28 15:17:31,912 - distributed.scheduler - ERROR - Workers don't have promised key: ['tcp://10.203.101.102:35255'], infer_hogs_for_rhog_levels_recursively_future-c7e6b31e27dcba1e87e43d898aba9c49
NoneType: None
2022-10-28 15:18:39,674 - distributed.scheduler - ERROR - Couldn't gather keys {'infer_hogs_for_rhog_levels_recursively_future-94108266e29785ca7cff8a9b2f221054': []} state: ['processing'] workers: []
NoneType: None
2022-10-28 15:18:39,674 - distributed.scheduler - ERROR - Workers don't have promised key: [], infer_hogs_for_rhog_levels_recursively_future-94108266e29785ca7cff8a9b2f221054
NoneType: None
2022-10-28 15:18:39,674 - distributed.scheduler - ERROR - Couldn't gather keys {'infer_hogs_for_rhog_levels_recursively_future-122e1af610164acb21a2749a4011c89f': []} state: ['processing'] workers: []
NoneType: None
2022-10-28 15:18:39,674 - distributed.scheduler - ERROR - Workers don't have promised key: [], infer_hogs_for_rhog_levels_recursively_future-122e1af610164acb21a2749a4011c89f
NoneType: None
The slurmout put of one of computation node
2022-10-28 14:51:20,449 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.203.101.102:43799'
2022-10-28 14:51:21,369 - distributed.diskutils - INFO - Found stale lock file and directory '/work/FAC/FBM/DBC/cdessim2/default/smajidi1/fastget/bird_ho
g/gethog3_27oct/dask-worker-space/worker-ccsaoycd', purging
2022-10-28 14:51:24,536 - distributed.worker - INFO - Start worker at: tcp://10.203.101.102:35255
2022-10-28 14:51:24,536 - distributed.worker - INFO - Listening to: tcp://10.203.101.102:35255
2022-10-28 14:51:24,536 - distributed.worker - INFO - dashboard at: 10.203.101.102:35613
2022-10-28 14:51:24,536 - distributed.worker - INFO - Waiting to connect to: tcp://10.203.101.150:35647
2022-10-28 14:51:24,536 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 14:51:24,536 - distributed.worker - INFO - Threads: 1
2022-10-28 14:51:24,536 - distributed.worker - INFO - Memory: 83.82 GiB
2022-10-28 14:51:24,537 - distributed.worker - INFO - Local Directory: /work/FAC/FBM/DBC/cdessim2/default/smajidi1/fastget/bird_hog/gethog3_27oct/dask-worker-space/worker-_eqjkonw
2022-10-28 14:51:24,537 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 14:51:24,552 - distributed.worker - INFO - Registered to: tcp://10.203.101.150:35647
2022-10-28 14:51:24,552 - distributed.worker - INFO - -------------------------------------------------
2022-10-28 14:51:24,553 - distributed.core - INFO - Starting established connection
2022-10-28 14:51:29,321 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.46s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
/work/miniconda3/lib/python3.8/subprocess.py:844: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stdout = io.open(c2pread, 'rb', bufsize)
/work/miniconda3/lib/python3.8/subprocess.py:849: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stderr = io.open(errread, 'rb', bufsize)
2022-10-28 15:08:25,493 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.00s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2022-10-28 15:11:00,914 - distributed.core - INFO - Event loop was unresponsive in Worker for 5.70s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2022-10-28 15:13:31,653 - distributed.core - INFO - Event loop was unresponsive in Worker for 4.01s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
The output of each task is an integer variable, which of course is not large data, however, I open and close files of few MB during the code.
I guess I can change some config from this.
I would appreciate it if you can help on this.
Regards,
Sina