Throw error "bytes object is too large" when train large dataset for lightgbm on high performance single host

Hi, team, Recently we are trying to train a lightgbm model on a dataset of about 100GiB on a high performance machine with 100cores and 400G RAM.
I used a local cluster and run the code referring this example
The version for lightgbm and dask is
dask 2023.5.0
lightgbm 3.3.5

And I got the the error like these: Any suggestion ? Thank u very much


/usr/local/lib/python3.8/site-packages/lightgbm/dask.py:526: UserWarning: Parameter n_jobs will be ignored.
  _log_warning(f"Parameter {param_alias} will be ignored.")
/usr/local/lib/python3.8/site-packages/lightgbm/dask.py:526: UserWarning: Parameter nthread will be ignored.
  _log_warning(f"Parameter {param_alias} will be ignored.")
/usr/local/lib/python3.8/site-packages/distributed/client.py:3108: UserWarning: Sending large graph of size 76.51 GiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(
2023-12-07 23:19:08,103 - distributed.protocol.core - CRITICAL - Failed to Serialize
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/distributed/protocol/core.py", line 109, in dumps
    frames[0] = msgpack.dumps(msg, default=_encode_default, use_bin_type=True)
  File "/usr/local/lib/python3.8/site-packages/msgpack/__init__.py", line 36, in packb
    return Packer(**kwargs).pack(o)
  File "msgpack/_packer.pyx", line 294, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 300, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 297, in msgpack._cmsgpack.Packer.pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 231, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 264, in msgpack._cmsgpack.Packer._pack
  File "msgpack/_packer.pyx", line 202, in msgpack._cmsgpack.Packer._pack
ValueError: bytes object is too large
2023-12-07 23:19:08,107 - distributed.comm.utils - ERROR - bytes object is too large

Hi @tongxin.wen, welcome to Dask community,

How are you reading your input dataset? There’s a hint in the error message:

/usr/local/lib/python3.8/site-packages/distributed/client.py:3108: UserWarning: Sending large graph of size 76.51 GiB.
This may cause some slowdown.

I think you first read you data localy before trying to feed it into Dask or your model, you should read your data by chunks on the Workers. msgpack cannot serialize objects of more than 4GiB.

Thanks @guillaumeeb
I also noticed this msg
At first my plan was to run it in local host and then extend the job to run on a cluster.
But now seems I have to run it in cluster.

I’m not sure of what you are saying. The important point is to read your input data through Workers directly.