Unmanaged memory from Polars / Jemalloc

elementace · October 5, 2025, 7:44pm

I thought I would make a post to highlight to the community a serious memory cleanup bug currently plaguing polars since 1.27.1. When investigating, I first thought it was a dask memory issue, as the unmanaged memory on my workers wasn’t releasing, and just continually building until my workers paused, spilled and terminated themselves (no matter how much ram I seemingly threw at the problem)

Crux of the problem is that whenever you create a polars dataframe, the memory allocator will never see that memory as released back to the OS, so my Unmanaged memory climbed infinitely; no matter how much malloc clearing, object deleting or garbage collecting I did. I was seeing this on Ubuntu images FWIW.

Jemalloc never releases zeroed (muzzy) memory to the OS.
Your memory usage will plateau but never drop, even after you delete big DataFrames and call gc.collect().
You may observe high resident set size (RSS) in htop, even when Python's heap is mostly empty.
That's why Polars seems to never release memory, even after garbage collection.

You can follow along here:

github.com/pola-rs/polars

Free RAM not released to OS after heavy dataframe operations

opened 04:27AM - 09 Jun 25 UTC

fingoldo

enhancement closing-candidate

### Description I work on latest Ubuntu linux (24.04.2 LTS, but prev versions s…uffer from the same) and latest Polars (1.30). I start with a dataframe of size of 10Gb, and perform a lot of groupby, join, math operations. As a result, I end up having a 100Gb final dataframe that is returned from a python function that does all the processing. However, OS shows that 1TB of RAM is used. I found [this ](https://stackoverflow.com/questions/76061800/polars-df-takes-up-lots-of-ram) explanation from Ritchie that "3x a table size is actually pretty good". However, not x10, right? Especially if we are talking about 0.9Tb wasted. But the more, the merrier. If I proceed with computing more columns, this is the last row from my log file: _**RSS=2766.6 GB**, sys RAM available=2543.5 GB, df size=**296.8 GB**._ And on the next operation process simply OOMs, 'cause total RAM size of the node is 2.9Tb. Real data size is 300GB though. Whatever I tried: calling malloc_trim(0) from within my Python process, running process with jemalloc allocator - this changes nothing, no effect whatsoever. AI suggested this: > > Why malloc_trim(0) and jemalloc don’t work from Python with Polars > ❌ Why malloc_trim(0) from Python doesn't work: > Polars is written in Rust, and compiled Rust code by default uses libc::malloc, not Python's malloc or the heap that Python controls. > > So when you call malloc_trim(0) in Python, it only affects the memory allocated via Python’s own malloc (e.g., NumPy, lists, etc.), not memory allocated inside Polars (Rust). > > The memory bloat you’re seeing is likely entirely in Rust’s heap, managed separately. > > ❌ Why jemalloc via LD_PRELOAD didn't help: > LD_PRELOAD only affects dynamically-linked binaries. > > Many Rust programs — including Polars and Arrow components — are compiled with --release and statically link to the default system allocator. > > So jemalloc never got used by the Rust part, even though you preloaded it. ✅ Why doing it from within Rust could help > Yes — calling malloc_trim(0) or linking jemalloc from within the Rust side of Polars can work, but only if Polars is explicitly built to use jemalloc or invokes trimming from Rust > > 🔧 What would work (on the Rust side): > ✅ 1. Polars compiled with jemalloc as global allocator > Inside Polars (or Arrow, or any Rust dependency handling large memory): > > rust > #[global_allocator] > static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc; > ✅ Effect: > > jemalloc returns memory to OS much more effectively than glibc. > > This will solve fragmentation and memory hoarding issues if jemalloc is actually linked and used. > > ✅ 2. Call malloc_trim(0) from Rust directly > If sticking with glibc, add in Rust: > > rust > extern "C" { > fn malloc_trim(pad: usize) -> i32; > } > unsafe { > malloc_trim(0); > } > This will work because: > > It's executed from within the same allocator and heap as the Rust code that allocated the memory. > > But: > > glibc still doesn’t return memory unless fragmentation conditions are met. > > So this is hit-or-miss, unlike jemalloc which is much more consistent. ✅ Concrete Path Forward > Best option: > Have the Polars Rust code base expose an optional jemalloc feature, or at least support compiling with: > > bash > RUSTFLAGS="-C target-cpu=native" cargo build --release --features jemalloc > And inside Polars: > > rust > #[cfg(feature = "jemalloc")] > #[global_allocator] > static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc; I'm not an expert in C, Rust, Linux kernels or allocators. But this simply feels wrong. Please, if some measures from described above can be taken, let's optionally take them. For the God's sake, please implement them. What kind of high perf, big data number crunching engine is this if it handles memory so inefficiently? Maybe there is some known workaround? I've asked in discord, to no avail. But I am sharing the workaround I found that works for Windows. You can simply call ```python def trim_windows_process_memory(pid: int = None) -> bool: """Causes effect similar to malloc_trim on -nix.""" # Define SIZE_T based on the platform (32-bit or 64-bit) if ctypes.sizeof(ctypes.c_void_p) == 4: SIZE_T = ctypes.c_uint32 else: SIZE_T = ctypes.c_uint64 # Get a handle to the current process if not pid: pid = ctypes.windll.kernel32.GetCurrentProcess() # Define argument and return types for SetProcessWorkingSetSizeEx ctypes.windll.kernel32.SetProcessWorkingSetSizeEx.argtypes = [ ctypes.wintypes.HANDLE, # Process handle SIZE_T, # Minimum working set size SIZE_T, # Maximum working set size ctypes.wintypes.DWORD, # Flags ] ctypes.windll.kernel32.SetProcessWorkingSetSizeEx.restype = ctypes.wintypes.BOOL # Define constants for SetProcessWorkingSetSizeEx QUOTA_LIMITS_HARDWS_MIN_DISABLE = 0x00000002 # Attempt to set the working set size result = ctypes.windll.kernel32.SetProcessWorkingSetSizeEx(pid, SIZE_T(-1), SIZE_T(-1), QUOTA_LIMITS_HARDWS_MIN_DISABLE) if result == 0: # Retrieve the error code error_code = ctypes.windll.kernel32.GetLastError() logger.error(f"SetProcessWorkingSetSizeEx failed with error code: {error_code}") return False else: return True ``` after heavy operations, along with gc.collect() that releases unused RAM to OS with no problems. However, Windows not always can be used due to a plethora of reasons. I hope to find some clean solution for Linux as well.

Looks like it was a conscious decision made by the Polars community here:
https://github.com/pola-rs/polars/issues/18088#issuecomment-2277968519

guillaumeeb · October 9, 2025, 9:12am

Hi @elementace, welcome to Dask community, and thanks for sharing!

Topic		Replies	Views
Read_csv Unmanaged Memory Dask DataFrame csv	4	64	April 29, 2025
Why I get a lot of unmanaged memory? Distributed	27	4228	February 28, 2023
Unmanaged memory high even after future collection Distributed	2	255	December 5, 2023
Delayed functions memory leak by using pandas Dataframe Distributed delayed , distributed	3	319	October 2, 2023
Unable to remove unmanaged memory Distributed kubernetes , future , distributed	8	1169	May 10, 2023

Unmanaged memory from Polars / Jemalloc

Related topics