Segment Fault due to Side Effect

This is not a good example to demonstrate the segment fault triggered, which shows no useful log at all, but it still shows the side effect on each worker that executes tasks.

The following code demonstrates that the side effect, which is setting some internal state of the worker to raise exception instead of showing warnings. From my point of view, since there is no code dependencies in different delayed tasks, they should not share any state.

In my segment fault case, I did the same thing, but the segment fault arises right after the restart. I think it is possible that the side effect that persist across worker restarts that causes the memory access violation. I attached sharable log at the end of the post. I cannot share the exact code but I am able to avoid triggering the segment fault by removing the warnings code.

PoC code

import time
import random

import dask
from dask.distributed import LocalCluster, Client

def capture_exceptions(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print(e)
            return None
    return wrapper

@capture_exceptions
def get_delayed_side_effect(x: int) -> int:
    import warnings
    warnings.warn("Side effect is being applied")
    warnings.filterwarnings("error")
    time.sleep(random.random())
    print(x)
    return x

def main():
    client = Client("tcp://localhost:8786")

    for outer in range(3):
        l_delayed = []
        for i in range(10):
            l_delayed.append(dask.delayed(get_delayed_side_effect)(i))
        
        futures = client.compute(l_delayed)
        for future in futures:
            print(future.result())
        client.restart()

if __name__ == "__main__":
    main()

Actual worker output at segment fault

2025-11-18 22:43:37,336 - distributed.nanny - INFO - Worker process 1257699 was killed by signal 11
2025-11-18 22:43:37,386 - distributed.nanny - WARNING - Restarting worker
2025-11-18 22:43:37,569 - distributed.nanny - INFO - Worker process 1257707 was killed by signal 11
2025-11-18 22:43:37,586 - distributed.nanny - WARNING - Restarting worker
2025-11-18 22:43:37,792 - distributed.nanny - INFO - Worker process 1257724 was killed by signal 11
2025-11-18 22:43:37,809 - distributed.nanny - WARNING - Restarting worker
2025-11-18 22:43:37,992 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-91grg2rk', purging
2025-11-18 22:43:37,994 - distributed.nanny - INFO - Worker process 1257680 was killed by signal 11
2025-11-18 22:43:38,006 - distributed.nanny - WARNING - Restarting worker
2025-11-18 22:43:38,194 - distributed.nanny - INFO - Worker process 1257738 was killed by signal 11

Client side segment fault error

distributed.scheduler.KilledWorker: Attempted to run task 'a-function-name-2f968de0d0af6189cdef9f8af9309313' on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://192.168.1.1:39875. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
(

Hi @Karl_Han, welcome to Dask community,

I read and tried your code, but I’m not sure what to expect… I don’t see any segment fault. I’m not sure of what state you are referring to…

Thanks for pointing that out. I will change my statement to make it clear, since the example is the real problem that triggers segment fault. However, the idea is similar, because the ability to set global state can lead to access freed code memory space.

I add change a little bit of the code to demonstrate the value and exception handling better across different restarts.

Setup and Code

  • Setup: default scheduler + 2 workers
  • Code: evaluate the 10 delayed tasks and restart for three times, while the tasks are trying to print its int value, ranging from 0 to 29
# uv run dask scheduler
# uv run dask worker tcp://127.0.0.1:8786 --nthreads 1 --nworkers 2
import time
import random

import dask
from dask.distributed import LocalCluster, Client

def capture_exceptions(func):
    def wrapper(*args, **kwargs):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print(e)
            return None
    return wrapper

@capture_exceptions
def get_delayed_side_effect(x: int) -> int:
    import warnings
    try:
        warnings.warn("Side effect is being applied")
        warnings.filterwarnings("error")
        time.sleep(random.random())
        print(x)
        return x
    except Exception as e:
        print("Exception caught")
        return None

def main():
    import warnings
    # Suppress a specific warning category
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    client = Client("tcp://localhost:8786")

    for outer in range(3):
        l_delayed = []
        for i in range(10):
            l_delayed.append(dask.delayed(get_delayed_side_effect)(i + outer * 10))
        
        futures = client.compute(l_delayed)
        for future in futures:
            print(future.result())
        client.restart()

if __name__ == "__main__":
    main()

Expectation and Reality

  • Expectation: each task should print its value without throwing an error and print None, as they are separated tasks and not affecting each other with state.
  • Reality: the warnings filtering convert warning to error in the global state of each worker, therefore, only two of the tasks are printing its value while the others are throwing out exceptions due to the global state of the workers have been set within the same restart runs/same outer value.

Output from console:

None
None
None
None
None
None
6
None
None
9
None
None
12
None
None
None
None
None
None
19
20
None
None
23
None
None
None
None
None
None

Output from workers (truncated some of the restart log)

2025-12-01 17:43:34,481 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:35727'
...
2025-12-01 17:43:35,544 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:8786
/examples/dask_warning_sideeffect.py:20: UserWarning: Side effect is being applied
  warnings.warn("Side effect is being applied")
/examples/dask_warning_sideeffect.py:20: UserWarning: Side effect is being applied
  warnings.warn("Side effect is being applied")
9
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
6
2025-12-01 17:43:38,689 - distributed.nanny - INFO - Nanny asking worker to close. Reason: client-restart-1764611018.6873732
...
2025-12-01 17:43:39,080 - distributed.worker - INFO - -------------------------------------------------
2025-12-01 17:43:39,080 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:8786
/examples/dask_warning_sideeffect.py:20: UserWarning: Side effect is being applied
  warnings.warn("Side effect is being applied")
/examples/dask_warning_sideeffect.py:20: UserWarning: Side effect is being applied
  warnings.warn("Side effect is being applied")
12
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
19
Exception caught
2025-12-01 17:43:39,982 - distributed.nanny - INFO - Nanny asking worker to close. Reason: client-restart-1764611019.9810302
...
2025-12-01 17:43:40,373 - distributed.worker - INFO - -------------------------------------------------
2025-12-01 17:43:40,374 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:8786
/examples/dask_warning_sideeffect.py:20: UserWarning: Side effect is being applied
  warnings.warn("Side effect is being applied")
/examples/dask_warning_sideeffect.py:20: UserWarning: Side effect is being applied
  warnings.warn("Side effect is being applied")
23
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
Exception caught
20
Exception caught
2025-12-01 17:43:40,794 - distributed.nanny - INFO - Nanny asking worker to close. Reason: client-restart-1764611020.7929833
...
2025-12-01 17:43:41,149 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:8786

Thanks for clarifying!

As much as I understand your point of view, since Dask and its are full Python, and tasks are running into Python threads, this is to be expected. There is no default mechanism to avoid that.

One thing you could try is to change the executor of the task, using for example

 with dask.annotate(executor="processes"):

Not sure it will be enough though…

See https://www.youtube.com/watch?v=vF2VItVU5zg&t=467s.