Unmanaged memory from Polars / Jemalloc

I thought I would make a post to highlight to the community a serious memory cleanup bug currently plaguing polars since 1.27.1. When investigating, I first thought it was a dask memory issue, as the unmanaged memory on my workers wasn’t releasing, and just continually building until my workers paused, spilled and terminated themselves (no matter how much ram I seemingly threw at the problem)

Crux of the problem is that whenever you create a polars dataframe, the memory allocator will never see that memory as released back to the OS, so my Unmanaged memory climbed infinitely; no matter how much malloc clearing, object deleting or garbage collecting I did. I was seeing this on Ubuntu images FWIW.

Jemalloc never releases zeroed (muzzy) memory to the OS.
Your memory usage will plateau but never drop, even after you delete big DataFrames and call gc.collect().
You may observe high resident set size (RSS) in htop, even when Python's heap is mostly empty.
That's why Polars seems to never release memory, even after garbage collection.

You can follow along here:

Looks like it was a conscious decision made by the Polars community here:
https://github.com/pola-rs/polars/issues/18088#issuecomment-2277968519

Hi @elementace, welcome to Dask community, and thanks for sharing!