Hello guys!
I´ve been using Dask for some while as pandas became not an option after some of my dataframes increased (+ 120 mm rows).
As expected, Dask performance is not as good as old pandas. Ok. I was just using dask[dataframe] as a simple replacement for pandas.
Sometime ago I´ve read about dask[distributed]. As far as I could feel, running my queries under distributed seems much faster than the no distributed df.
Dask Distributed :
- Running on local 4 workers
- Seems much faster
So far everything is good!
But… With distributed I face some crashes where all the workers fail due to memory limitations, as when using no distributed, the code runs (slow and still!)
- Are my perceptions that distributed is faster than non distributed, even using just local machine?
- Why would distributed crash?
Thanks in advance!