LightGBM Distributed Training

Hi,

I am training a distributed lgbm model using Dask, I have parquet data and I load it using read_parquet utility. I then split the df to extract the data and the regression target. So I end up with two different dataframes, let’s call them X and y.

When I go to fit my model I run into the issue that the X and y have different partitions and thus lgbm cannot be fit on that.

I can get the code to run by converting the Dataframe to a dask array using to_dask_array. Is this approach correct? I saw XGBoost has the concept of a Dask matrix. Can’t find anything online and I’m trying to understand if I have a fundamental misunderstanding here or if indeed dataframe to array conversion is the suggested route. I am also assuming that within a partition ordering is preserved so the mapping between feature and regression target is maintained.

Hi @Matteo_Ciccozzi!

Could you provide a MCVE of your error with some Random data? This would help a lot to understand your problem.

This might well be the case, if so, using Dask Array shouldn’t make any difference? However, I see from the LightGBM doc:

While setting up for training, lightgbm will concatenate all of the partitions on a worker into a single dataset

So I’m not sure why aligned partitions should be needed. In any case, LightGBM should be compatible with Dask Array or Dask DataFrame, but maybe they need to be aligned. Did you ask the question on a LightGBM tracker?

Hi! Thank you for the reply. I was able to get this to work but I don’t quite understand why.
When I loaded the data I then indexed the X and y based on the column name and then I added a persist and _wait call. I think I understand in theory how this might help, but I haven’t been able to conclusively justify my reasoning yet.

Will come back to this once I have some time.

1 Like

Sorry coming back to this now but I’m seeing some strange behavior still. here is the MCVE: [LightGBM- DASK] Length of labels differs from the length of #data when working with parquet data. · Issue #6920 · microsoft/LightGBM · GitHub

Is it expected that I need to add the persist to align these? It feels a bit strange.

I’m also noticing that I run into a lot of memory issues when decompressing these parquet in memory, I hit the dask worker memory limitations pretty quickly.

From a Dask point of view, this should’nt be. As you say, this prevent the use of Dask and LighGBM with bigger than memory datasets.

I guess every worker only persists the partitions they are assigned right?

Yes, but all the data will be loaded in the distributed memory, spreaded among workers.

But wouldn’t that be the case anyway? I’m assuming lightgbm library also does something like that (causing the OOM issues when we have 2-3x the data in memory). If I have 20 workers the data is spread between them which should make it manage-able in theory.

With Dask only, not necessarily. With LightGBM, I admit I don’t know…

Thanks, sorry if I am being persistent. Could you help me understand what you mean when you say “With Dask only, not necessarily” ? That part is a bit lost on me and I’m new to Dask.

I was meaning that with dask only, if you don’t use persist, all the data does not have to be in memory at once. Dask is able to handle data set bigger than memory in a streaming fashion if your algorithm is compatible.