How to interpret method bag's npartition key?

I’m using something like below in cluster and I am getting weird process tracking

n_worker = 20
b = db.from_sequence(load('all_items.pkl'), npartitions=20)

in the near end of the process, some of the node starting idling, so how to setting npartition for the task?

Hi @b-y-f,


means that you split your input sequence into 20 partitions. So Dask will have 20 tasks to compute on its workers. If you have 20 partitions and 20 workers, each worker will process one partition. If some partitions are processed faster than others, some workers won’t have any more work to perform.

In distributed processing in general, and especially if the time taken to process each partition is variable, it is a good practice to have at least twice partitions as workers. But you can go with more if processing each partition takes a long time.