Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto

Hi All, I am Emmanuel Katto from Uganda. I’m working with a large dataset (~10 GB) that I need to process using Dask. I’ve created a Dask DataFrame from my data using pd.read_csv and dask.dataframe.read_csv. However, I’m experiencing slow performance when performing operations like filtering, grouping, and sorting.

To optimize my code, I’ve tried using various Dask techniques such as:

  1. Using dask.bag instead of dask.dataframe for processing large datasets
  2. Partitioning my data using dask.dataframe.map_partitions
  3. Using dask.dataframe.compute to compute the results in chunks
  4. Reorganizing my data to reduce the number of rows and columns

Despite these efforts, my code is still running slowly. Are there any other techniques or best practices I can use to optimize my code for large-scale data processing with Dask?

Emmanuel Katto

Hi @emmanuelkatto, welcome to Dask Discourse forum!

Dask should be perfect for processing such a dataset. It is not too big so it should be pretty easy. But it really depends on your Workflow and operations. Would you be able to produce a minimal reproducible example of what you are trying to do? At least a code snippet?

A few things reading your post:

  • Your should use dask.dataframe.read_csv, not Pandas.
  • grouping, and sorting can be expensive when performed in a distributed way.
  • You shouldn’t have to use dask.bag here.
  • dask.dataframe.map_partitions does not partition data, it runs your code on every partition of your dask dataframe.
  • dask.dataframe.compute will apply your computations and return you a final object.

But again, without some reproducer, it’s hard to help further.