Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto

emmanuelkatto · July 9, 2024, 6:47am

Hi All, I am Emmanuel Katto from Uganda. I’m working with a large dataset (~10 GB) that I need to process using Dask. I’ve created a Dask DataFrame from my data using pd.read_csv and dask.dataframe.read_csv. However, I’m experiencing slow performance when performing operations like filtering, grouping, and sorting.

To optimize my code, I’ve tried using various Dask techniques such as:

Using dask.bag instead of dask.dataframe for processing large datasets
Partitioning my data using dask.dataframe.map_partitions
Using dask.dataframe.compute to compute the results in chunks
Reorganizing my data to reduce the number of rows and columns

Despite these efforts, my code is still running slowly. Are there any other techniques or best practices I can use to optimize my code for large-scale data processing with Dask?

Thanks!
Emmanuel Katto

guillaumeeb · July 11, 2024, 2:46pm

Hi @emmanuelkatto, welcome to Dask Discourse forum!

Dask should be perfect for processing such a dataset. It is not too big so it should be pretty easy. But it really depends on your Workflow and operations. Would you be able to produce a minimal reproducible example of what you are trying to do? At least a code snippet?

A few things reading your post:

Your should use dask.dataframe.read_csv, not Pandas.
grouping, and sorting can be expensive when performed in a distributed way.
You shouldn’t have to use dask.bag here.
dask.dataframe.map_partitions does not partition data, it runs your code on every partition of your dask dataframe.
dask.dataframe.compute will apply your computations and return you a final object.

But again, without some reproducer, it’s hard to help further.

S1234 · September 18, 2024, 6:51pm

I have a similar issue is there any way to do this without breaking, I do not mind dask slowing down but it needs to get the job done

guillaumeeb · September 19, 2024, 2:22pm

Hi @S1234 could you open another post, give some context and especially some reproducible example?

Topic		Replies	Views
Need help with efficient parallelization [local machine] Distributed delayed , distributed	2	250	July 30, 2022
Performance of Dask DataFrames for Feature Engineering Dask DataFrame	9	1168	March 2, 2023
Seeking Feedback on Dask Implementation for Custom Function Application Dask DataFrame delayed	4	34	January 10, 2025
Dask very slow with simple processing of large parquet file Dask DataFrame	2	1715	August 29, 2022
Bad performance with dask in k8s?	1	389	May 12, 2022

Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto

Related topics