Hi
Thank you @guillaumeeb for your help, and very sorry for the late reply. I had to re-run some tests.
- The first issue was avoided by using the shared folder, as you suggested, on the Slurm cluster. However, my question was about adding python files in the parent directory of my current working directory, but it does not matter anymore, since using the shared folder resolves it.
- A snippet of the
LocalCluster
code:
client = Client(n_workers=nworkers, threads_per_worker=nthreads_per_worker)
with performance_report(filename="dask_report_map_"+anal_timing_id+"_"+str(time.time())+".html"):
tmp_paw_computations=client.map(PlantAvailableWater_numpy_dask,poro_files[:max_num_requests],satur_files[:max_num_requests],[set_key_dims for i in range(max_num_requests)])
final_paw_results=client.gather(tmp_paw_computations)
while the code snippet for SLURMcluster
looks as follows (some options in the SLURMCluster()
are old as I’m restricted to the available modules at my institution):
num_processes_per_job=10 # Setting number of processes per job to int(os.cpu_count()/2) leads to some workers exceeding %95 of memory. So, it may be better to reduce number of of processes pre job
num_cores_per_job=num_processes_per_job # Number of threads per job
num_gpus_per_job=4
num_jobs=2
# SOME OTHER STEPS
if __name__ == '__main__':
# SOME OTHER STEPS
cluster = SLURMCluster(
cores=num_cores_per_job,
processes=num_processes_per_job,
memory="12GB",
shebang="#!/usr/bin/env bash",
queue=queue,
scheduler_options={"dashboard_address": ":"+str(port)},
walltime="01:00:00",
local_directory=os.environ["SCRATCH_deepacf"],
death_timeout="60m",
log_directory=f'{os.environ["HOME"]}/dask_jobqueue_logs',
interface="ib0",
project=project,
#job_script_prologue
env_extra= ['source $HOME/jupyter/kernels/dg_rr_analytics/bin/activate'], # Loading special python environment before calculations. This parameter should be replaced by job_script_prologue in the future
python="$HOME/jupyter/kernels/dg_rr_analytics/bin/python",
job_extra=['--gres gpu:'+str(num_gpus_per_job)]
)
cluster.scale(jobs=num_jobs) # Without this step, no job has been launched (yet)
client = Client(cluster)
# SOME OTHER STEPS
#with performance_report(filename="dask_jobqueue_report_map_"+anal_timing_id+"_"+str(time.time())+".html"):
tmp_paw_computations=client.map(PlantAvailableWater_numba_cuda_dask_with_data_preparation,poro_files[0:max_num_requests],satur_files[0:max_num_requests],[set_key_dims for i in range(max_num_requests)],[os.path.join(libraries_path, "tmp.pfb") for i in range(max_num_requests)],[Alpha for i in range(max_num_requests)],[Nvg for i in range(max_num_requests)],[Sres for i in range(max_num_requests)],[nrw_paw_params for i in range(max_num_requests)],[(10,10) for i in range(max_num_requests)],[() for i in range(max_num_requests)],[paw_res_path_dask for i in range(max_num_requests)])
final_paw_results=client.gather(tmp_paw_computations)
I face this memory problem when trying to get the same number of workers (40 workers) on SLURMCluster
with 1 job, while launching 40 workers succeeds with the distributed cluster on a local machine.
- In the same
SLURMCluster
snippet above, if I un-commented the with performance_report
statement (and of course, used proper indentation for the next steps), I receive the following error:
distributed.core - ERROR - Exception while handling op performance_report
Traceback (most recent call last):
File "/p/software/juwels/stages/2022/software/dask/2021.9.1-gcccoremkl-11.2.0-2021.4.0/lib/python3.9/site-packages/distributed/core.py", line 502, in handle_comm
result = await result
File "/p/software/juwels/stages/2022/software/dask/2021.9.1-gcccoremkl-11.2.0-2021.4.0/lib/python3.9/site-packages/distributed/scheduler.py", line 7356, in performance_report
task_stream = self.get_task_stream(start=start)
File "/p/software/juwels/stages/2022/software/dask/2021.9.1-gcccoremkl-11.2.0-2021.4.0/lib/python3.9/site-packages/distributed/scheduler.py", line 7030, in get_task_stream
return plugin.collect(start=start, stop=stop, count=count)
File "/p/software/juwels/stages/2022/software/dask/2021.9.1-gcccoremkl-11.2.0-2021.4.0/lib/python3.9/site-packages/distributed/diagnostics/task_stream.py", line 59, in collect
start = bisect(start, 0, len(self.buffer))
File "/p/software/juwels/stages/2022/software/dask/2021.9.1-gcccoremkl-11.2.0-2021.4.0/lib/python3.9/site-packages/distributed/diagnostics/task_stream.py", line 48, in bisect
startstop["stop"] for startstop in self.buffer[mid]["startstops"]
KeyError: 'startstops'
'startstops'
File "/p/home/jusers/elshambakey1/juwels/mini_dg_rr/scripts/paw_benchmarking_dask_jobqueue_map.py", line 188, in <module>
traceback.print_stack()
So, I’m using other ways like Python logging module (as suggested in the documents), and recording my own JSON files. But I just wonder if I’m doing something wrong, or if there is some way to enable performance_report
with the SLURMCluster
?
Regards