Describe the issue:
Calling numba wrapped functions with map_overlap will cause the core dump issue, if run in jupyter notebook then the kernel will die. This is associated with numpy structured array. Calling with naive numpy array seems fine. This issue is not occuring everytime, but re-run for 3-5 times should occurs, error info can be:
free(): invalid pointer
Aborted (core dumped)
or segment fault.
Interestingly if change map_overlap() to map_partitions() then the error won’t occur.
Minimal Complete Verifiable Example:
import os
import dask
from distributed import Client, LocalCluster
import numpy as np
import pandas as pd
from numba import njit
from IPython import embed
import dask.dataframe as dd
@njit(nogil=True)
def other_numba_func(array, idx, val):
array['Y'][idx] = val
@njit(nogil=True)
def numba_func(arr, params, array, other_numba_func):
other_numba_func(array, 0, arr['Z'][0])
return array
def get_structured_array(num_rows):
dtype = [
('X', np.float64),
('A', np.float64),
('B', np.float64),
('Y', np.float64)
]
ret = np.zeros(num_rows, dtype=dtype)
return ret
def call_numba_func(df: pd.DataFrame, **kwargs):
params = pd.DataFrame(kwargs, index=[0]).to_records()[0]
array = get_structured_array(df.shape[0])
array = numba_func(df.to_records(), params, array, other_numba_func)
df['X'] = array['X'].astype(np.float32)
return df
if __name__ == '__main__':
client = Client(
LocalCluster(n_workers=1, threads_per_worker=8, dashboard_address=18787),
set_as_default=True
)
data = {
'Z': [i for i in range(1000)],
}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=2)
ddf.map_overlap(call_numba_func, 0, 0).head()
client.shutdown()
Environment:
This is my environment:
Python 3.11.10
dask 2025.1.0 pypi_0 pypi
dask-cloudprovider 2024.9.1 pypi_0 pypi
dask-expr 1.1.20 pyhd8ed1ab_0 conda-forge
dask-labextension 7.0.0 pyhd8ed1ab_0 conda-forge
numba 0.61.0 pypi_0 pypi
numpy 2.1.0 pypi_0 pypi
pandas 2.2.3 py311h7db5c69_1 conda-forge
Description: Ubuntu 24.04.1 LTS
Release: 24.04
Codename: noble