LightGBM Distributed Training

Hi,

I am training a distributed lgbm model using Dask, I have parquet data and I load it using read_parquet utility. I then split the df to extract the data and the regression target. So I end up with two different dataframes, let’s call them X and y.

When I go to fit my model I run into the issue that the X and y have different partitions and thus lgbm cannot be fit on that.

I can get the code to run by converting the Dataframe to a dask array using to_dask_array. Is this approach correct? I saw XGBoost has the concept of a Dask matrix. Can’t find anything online and I’m trying to understand if I have a fundamental misunderstanding here or if indeed dataframe to array conversion is the suggested route. I am also assuming that within a partition ordering is preserved so the mapping between feature and regression target is maintained.

Hi @Matteo_Ciccozzi!

Could you provide a MCVE of your error with some Random data? This would help a lot to understand your problem.

This might well be the case, if so, using Dask Array shouldn’t make any difference? However, I see from the LightGBM doc:

While setting up for training, lightgbm will concatenate all of the partitions on a worker into a single dataset

So I’m not sure why aligned partitions should be needed. In any case, LightGBM should be compatible with Dask Array or Dask DataFrame, but maybe they need to be aligned. Did you ask the question on a LightGBM tracker?

Hi! Thank you for the reply. I was able to get this to work but I don’t quite understand why.
When I loaded the data I then indexed the X and y based on the column name and then I added a persist and _wait call. I think I understand in theory how this might help, but I haven’t been able to conclusively justify my reasoning yet.

Will come back to this once I have some time.