LightGBM Distributed Training

Matteo_Ciccozzi · February 24, 2025, 6:09am

Hi,

I am training a distributed lgbm model using Dask, I have parquet data and I load it using read_parquet utility. I then split the df to extract the data and the regression target. So I end up with two different dataframes, let’s call them X and y.

When I go to fit my model I run into the issue that the X and y have different partitions and thus lgbm cannot be fit on that.

I can get the code to run by converting the Dataframe to a dask array using to_dask_array. Is this approach correct? I saw XGBoost has the concept of a Dask matrix. Can’t find anything online and I’m trying to understand if I have a fundamental misunderstanding here or if indeed dataframe to array conversion is the suggested route. I am also assuming that within a partition ordering is preserved so the mapping between feature and regression target is maintained.

guillaumeeb · February 27, 2025, 5:17pm

Hi @Matteo_Ciccozzi!

Could you provide a MCVE of your error with some Random data? This would help a lot to understand your problem.

This might well be the case, if so, using Dask Array shouldn’t make any difference? However, I see from the LightGBM doc:

While setting up for training, lightgbm will concatenate all of the partitions on a worker into a single dataset

So I’m not sure why aligned partitions should be needed. In any case, LightGBM should be compatible with Dask Array or Dask DataFrame, but maybe they need to be aligned. Did you ask the question on a LightGBM tracker?

Matteo_Ciccozzi · March 9, 2025, 8:00pm

Hi! Thank you for the reply. I was able to get this to work but I don’t quite understand why.
When I loaded the data I then indexed the X and y based on the column name and then I added a persist and _wait call. I think I understand in theory how this might help, but I haven’t been able to conclusively justify my reasoning yet.

Will come back to this once I have some time.

Matteo_Ciccozzi · May 21, 2025, 5:51am

Sorry coming back to this now but I’m seeing some strange behavior still. here is the MCVE: [LightGBM- DASK] Length of labels differs from the length of #data when working with parquet data. · Issue #6920 · microsoft/LightGBM · GitHub

Is it expected that I need to add the persist to align these? It feels a bit strange.

I’m also noticing that I run into a lot of memory issues when decompressing these parquet in memory, I hit the dask worker memory limitations pretty quickly.

guillaumeeb · May 23, 2025, 2:09pm

From a Dask point of view, this should’nt be. As you say, this prevent the use of Dask and LighGBM with bigger than memory datasets.

Matteo_Ciccozzi · May 23, 2025, 2:42pm

I guess every worker only persists the partitions they are assigned right?

guillaumeeb · May 23, 2025, 2:43pm

Yes, but all the data will be loaded in the distributed memory, spreaded among workers.

Matteo_Ciccozzi · May 23, 2025, 2:56pm

But wouldn’t that be the case anyway? I’m assuming lightgbm library also does something like that (causing the OOM issues when we have 2-3x the data in memory). If I have 20 workers the data is spread between them which should make it manage-able in theory.

guillaumeeb · May 23, 2025, 2:58pm

With Dask only, not necessarily. With LightGBM, I admit I don’t know…

Matteo_Ciccozzi · May 23, 2025, 7:56pm

Thanks, sorry if I am being persistent. Could you help me understand what you mean when you say “With Dask only, not necessarily” ? That part is a bit lost on me and I’m new to Dask.

guillaumeeb · May 29, 2025, 3:20pm

I was meaning that with dask only, if you don’t use persist, all the data does not have to be in memory at once. Dask is able to handle data set bigger than memory in a streaming fashion if your algorithm is compatible.

Topic		Replies	Views
Aligning LightGBM Dask Classifier predictions with input data Dask DataFrame dask-array , partitioning	3	241	March 8, 2024
Throw error "bytes object is too large" when train large dataset for lightgbm on high performance single host Distributed distributed	3	413	December 13, 2023
DaskLGBMClassifier and Hypertuning using RandomizedSearchCV with DASK ECS Fargate Cluster Dask DataFrame future , distributed , dask-ml	2	572	March 23, 2023
Got error by running demo code Distributed distributed	1	223	August 18, 2023
Read_parquet results in imbalanced load across workers Dask DataFrame	3	22	June 27, 2025

LightGBM Distributed Training

Related topics