Dask Local Distributed vs Dataframe

frbelotto · August 26, 2024, 7:02pm

Hello guys!
I´ve been using Dask for some while as pandas became not an option after some of my dataframes increased (+ 120 mm rows).

As expected, Dask performance is not as good as old pandas. Ok. I was just using dask[dataframe] as a simple replacement for pandas.

Sometime ago I´ve read about dask[distributed]. As far as I could feel, running my queries under distributed seems much faster than the no distributed df.

Dask Distributed :

Running on local 4 workers
Seems much faster

So far everything is good!

But… With distributed I face some crashes where all the workers fail due to memory limitations, as when using no distributed, the code runs (slow and still!)

Are my perceptions that distributed is faster than non distributed, even using just local machine?
Why would distributed crash?

Thanks in advance!

guillaumeeb · August 28, 2024, 2:29pm

Hi @frbelotto,

Dask proposes several different Scheduling backends. DataFrame use a threaded Scheduler by default, meaning that each task or partition computation will be done inside a Python thread. Depending on your operation, this might mean no parallelization at all (the Python GIL).

Distributed Scheduler uses a different process for each Worker. This ensures parallelism no matter the code, but also add a bit of overhead due to serialization and data exchange. It also means that each Worker will only have access to a portion of your total available memory, which can cause memory problems depending on how you’ve configured it, your algorithms, and the chunking of your DataFrame.

So we’ll need more details to provide further help.

Topic		Replies	Views
Operations on a partitioned DataFrame not actually distributed across workers Dask DataFrame distributed	4	325	May 13, 2022
Why is 'processes' executing my code sequentially? Dask DataFrame	2	106	September 6, 2024
Need help with efficient parallelization [local machine] Distributed delayed , distributed	2	255	July 30, 2022
Issue in Parallel row preprocessing with Dask Dask DataFrame kubernetes , distributed	2	507	August 6, 2022
Best practices for passing a large dictionary to local cluster Distributed delayed	1	1552	March 8, 2022

Dask Local Distributed vs Dataframe

Related topics