Dask Local Distributed vs Dataframe

Hello guys!
I´ve been using Dask for some while as pandas became not an option after some of my dataframes increased (+ 120 mm rows).

As expected, Dask performance is not as good as old pandas. Ok. I was just using dask[dataframe] as a simple replacement for pandas.

Sometime ago I´ve read about dask[distributed]. As far as I could feel, running my queries under distributed seems much faster than the no distributed df.

Dask Distributed :

  • Running on local 4 workers
  • Seems much faster

So far everything is good!

But… With distributed I face some crashes where all the workers fail due to memory limitations, as when using no distributed, the code runs (slow and still!)

  1. Are my perceptions that distributed is faster than non distributed, even using just local machine?
  2. Why would distributed crash?

Thanks in advance!

Hi @frbelotto,

Dask proposes several different Scheduling backends. DataFrame use a threaded Scheduler by default, meaning that each task or partition computation will be done inside a Python thread. Depending on your operation, this might mean no parallelization at all (the Python GIL).

Distributed Scheduler uses a different process for each Worker. This ensures parallelism no matter the code, but also add a bit of overhead due to serialization and data exchange. It also means that each Worker will only have access to a portion of your total available memory, which can cause memory problems depending on how you’ve configured it, your algorithms, and the chunking of your DataFrame.

So we’ll need more details to provide further help.