I am training a distributed lgbm model using Dask, I have parquet data and I load it using read_parquet utility. I then split the df to extract the data and the regression target. So I end up with two different dataframes, let’s call them X and y.
When I go to fit my model I run into the issue that the X and y have different partitions and thus lgbm cannot be fit on that.
I can get the code to run by converting the Dataframe to a dask array using to_dask_array. Is this approach correct? I saw XGBoost has the concept of a Dask matrix. Can’t find anything online and I’m trying to understand if I have a fundamental misunderstanding here or if indeed dataframe to array conversion is the suggested route. I am also assuming that within a partition ordering is preserved so the mapping between feature and regression target is maintained.
Could you provide a MCVE of your error with some Random data? This would help a lot to understand your problem.
This might well be the case, if so, using Dask Array shouldn’t make any difference? However, I see from the LightGBM doc:
While setting up for training, lightgbm will concatenate all of the partitions on a worker into a single dataset
So I’m not sure why aligned partitions should be needed. In any case, LightGBM should be compatible with Dask Array or Dask DataFrame, but maybe they need to be aligned. Did you ask the question on a LightGBM tracker?
Hi! Thank you for the reply. I was able to get this to work but I don’t quite understand why.
When I loaded the data I then indexed the X and y based on the column name and then I added a persist and _wait call. I think I understand in theory how this might help, but I haven’t been able to conclusively justify my reasoning yet.
Is it expected that I need to add the persist to align these? It feels a bit strange.
I’m also noticing that I run into a lot of memory issues when decompressing these parquet in memory, I hit the dask worker memory limitations pretty quickly.
But wouldn’t that be the case anyway? I’m assuming lightgbm library also does something like that (causing the OOM issues when we have 2-3x the data in memory). If I have 20 workers the data is spread between them which should make it manage-able in theory.
Thanks, sorry if I am being persistent. Could you help me understand what you mean when you say “With Dask only, not necessarily” ? That part is a bit lost on me and I’m new to Dask.
I was meaning that with dask only, if you don’t use persist, all the data does not have to be in memory at once. Dask is able to handle data set bigger than memory in a streaming fashion if your algorithm is compatible.