I am a beginner for dask. I am trying to test dask for data computing on supercomputer with lsf scheduler.
The cores on each node are 40, with available mem for about 160GB.
What I have tried:
cluster = LSFCluster(queue='short',
cores=40,
interface='ib0',
memory='300 GB',
processes=80,
ncpus=80,
job_directives_skip=['span[hosts=1]','-W','-M'],
shebang="#!/bin/bash",
job_extra_directives=['-R "span[ptile=40]"'],
)
cluster.scale(1)
client = Client(cluster)
import time
from dask.distributed import progress
def slow_increment(x):
time.sleep(1)
return x+1
futures = client.map(slow_increment, range(5000))
progress(futures)
The job script (The head is the same as what administrator recommends):
#!/bin/bash
#BSUB -J dask-worker
#BSUB -q short
#BSUB -n 80
#BSUB -R "span[ptile=40]"
#BSUB -o %J.out
#BSUB -e %J.err
/work/myusername/softwares/python/anaconda3/2022.10/envs/test_multinode/bin/python -m distributed.cli.dask_worker tcp://10.2.x.xx:41754 --nthreads 0 --nworkers 80 --memory-limit 3.4GiB --name dummy-name --nanny --death-timeout 60 --interface ib0
I hope this job will use two node, each with 40 cores and 150GB mem.
However, when I am checking on each node with top
command, one of the nodes has many tasks, the other one has zero tasks.
So my questions are:
- Am I setting correctlly? The reason why I set the parameters of LSFCluster in this way is to ensure that the job_script is exactly the same as the example given by the administrator.
- Why the second node not working?
- I wonder if I am understanding it right: In the documentation, the term “job” refers to each task that we submit to the cluster, and there is no data exchange possible between these tasks.
Thank you in advance!