I’m evaluating Dask for some feature engineering tasks on a single node (not a cluster) with 4 CPUs and 16GB RAM. My datasets are 1GB and 3GB in size.
Unfortunately, I see Pandas and Polars “outperforming” Dask significantly.
My question: Are there any best practices to “optimize” Dask for things like group-by, binning or missing value imputation? I want to ensure I didn’t miss anything in my setup….
Example:
from dask.distributed import LocalCluster
cluster = LocalCluster()
import dask.dataframe as dd
transactions = dd.read_csv('transactions.csv', parse_dates = ['order_date', 'purchase_date', 'shipping_date', 'estimated_delivery_date', 'delivery_date', 'invoice_date'], dtype={'order_id':'Int32', 'product_id':'Int32', 'customer_id':'Int32', 'payment_type': 'category', 'order_status': 'category','invoice_no':'Int32'})
transactions['delivery_date_vs_estimation'] = (transactions.delivery_date - transactions.estimated_delivery_date)
transactions.groupby('product_id').delivery_date_vs_estimation.mean().compute()
The datatype of product_id is Int32 and delivery_date contains values of the type datetime64[ns].
This takes approx. 55 seconds in Dask, 5 seconds in Polars, and less than 1 second in Pandas. I verified that the LocalCluster uses all cores of my setup. I also changed the blocksize for the CSV, but that didn’t yield any result.
Any suggestions appreciated!