I had setup dask in k8s (with three nodes, 8threads cpu, 16GB memory limits per node), but the formance is bad, about 1 hour to count 52million records.
what’s go wrong, how can I improve?
import dask.dataframe as dd
df2=dd.read_csv(‘/mnt/output*.csv’)
%time df2.shape[0].compute()
[########################################] | 100% Completed | 1hr 6min 27.8s
CPU times: user 7min 55s, sys: 40.6 s, total: 8min 35s
Wall time: 1h 6min 27s
52069903
Apologies for the delay in response. I think you may already have fixed this.
For anyone else who may face the same issue in the future:
- Dask starts executing from
read_csv
when we call compute()
;
- In the above case, since we have a distributed setup, the data needs to be sent to each worker (which takes time) – that’s why it’s generally recommended to work with data stored remotely (on something like AWS S3) in such cases;
- The total size of data and the size of each partition will impact the time it takes to read the data.
There are some notes on how to optimize these in the docs here: dask.dataframe.read_csv — Dask documentation, and this blog post: Reading CSV files into Dask DataFrames with read_csv - Coiled : Coiled