What is the reason that dask dataframe takes long time to compute regardless of the size of dataframe and number of partitions. How to avoid this from happening ? What is the reason behind it?
Problem -
I’m currently working on AWS Sagemaker with ml.c5.2xlarge instance type and the data is in S3 bucket. I did not connect to client as I was not able to. I’m getting this error when I ran the client through local cluster → AttributeError: MaterializedLayer’ object has no attribute ‘pack_annotations’
So, I proceeded without connecting with anything specific, there by it is now on Default. (Cluster, Workers: 4, Cores: 8,Memory: 16.22 GB )
shape = df.shape
nrows = shape[0].compute()
print("nrows",nrows)
print(df.npartitions)
I tried to perform compute on 24700000 records (~27M), with 23 partitions and the time taken to execute is CPU times: user 4min 48s, sys: 12.9 s, total: 5min 1s Wall time: 4min 46s
For nrows 5120000 (~5M), with 23 partitions, and the time taken to execute is CPU times: user 4min 50s, sys: 12 s, total: 5min 2s Wall time: 4min 46s
For nrows 7697351 (~7M) with 1 partition, The time taken is CPU times: user 5min 4s, sys: 10.6 s, total: 5min 14s Wall time: 4min 52s
I performed the same operations in Pandas with 7690000 (~7M) and the time take to execute is CPU times: user 502 µs, sys: 0 ns, total: 502 µs Wall time: 402 µs
I’m just trying to find the shape of the data, But in Dask regardless of the type of operation the dask is taking same time to perform one compute operation.
May I know what is the reason behind this and what do I