Bad performance with dask in k8s?

RexHuang · April 14, 2022, 12:34pm

I had setup dask in k8s (with three nodes, 8threads cpu, 16GB memory limits per node), but the formance is bad, about 1 hour to count 52million records.

what’s go wrong, how can I improve?

import dask.dataframe as dd
df2=dd.read_csv(‘/mnt/output*.csv’)
%time df2.shape[0].compute()
[########################################] | 100% Completed | 1hr 6min 27.8s
CPU times: user 7min 55s, sys: 40.6 s, total: 8min 35s
Wall time: 1h 6min 27s
52069903

pavithraes · May 12, 2022, 6:20pm

Apologies for the delay in response. I think you may already have fixed this.

For anyone else who may face the same issue in the future:

Dask starts executing from read_csv when we call compute();
In the above case, since we have a distributed setup, the data needs to be sent to each worker (which takes time) – that’s why it’s generally recommended to work with data stored remotely (on something like AWS S3) in such cases;
The total size of data and the size of each partition will impact the time it takes to read the data.

There are some notes on how to optimize these in the docs here: dask.dataframe.read_csv — Dask documentation, and this blog post: Reading CSV files into Dask DataFrames with read_csv - Coiled : Coiled

Topic		Replies	Views
Why does dask take long time to compute regardless of the size of dataframe and partitions Dask DataFrame	2	2407	April 1, 2022
Seeking Feedback on Dask Implementation for Custom Function Application Dask DataFrame delayed	4	36	January 10, 2025
Need help with efficient parallelization [local machine] Distributed delayed , distributed	2	252	July 30, 2022
Dask.read_sql_table() too slow Distributed	7	348	June 16, 2023
Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto Dask DataFrame delayed	3	83	September 19, 2024

Bad performance with dask in k8s?

Related topics