From my experience, I’ve found the Dask Xgboost classifier to be much slower than using the same amount of core on a bigger since machine. For example, using a large machine with 96 cores and regular Xgboost, would be 2-4 times faster than using a dask cluster with 4 workers with 24 cores with Dask Xgboost.
Is this supposed to be the case? Does anyone know how I can improve the speed on Dask Xgboost?
Distributed processing always comes with a cost: serialization, data exchange between process or through network, synchronization… This is not surprising at all to observe that a code is more efficient on a single machine without distributed computing than with a distributed processing on several processes or servers. Especially if Xgboost is already optimized and parallelized when using it on a single machine.
Anyway, it would be interesting to understand where is the efficiency loss when using Dask. In order to do this, it would be necessary to have a Minimum Reproducible Example of some Xgboost code on a smaller case, like something which could run on a standard laptop.
Yes, as Guillaume says above, memory bandwidth is faster than network bandwidth. If you can fit your problem on one large machine then you should certainly do so. You should only use Dask when you need to switch to multiple machines. You should avoid this move as long as possible.