I added docling as a part of dask pipeline recently. Docling is a document processor with it’s own thread pool which might be causing some contention with dask. I limit it to 1/5th of available cores and in general aside from this heavy multithreaded task load on workers is not too high.
My jobs started failing randomly with FutureCancelledError()
without clear cause. Client side tracebacks:
[2025-01-21 20:50:02,698 ERROR service] worker 2: handling event FileCreate(timestamp=1737492574, service='file_processor', customer_id=ObjectId('6660a87e4f67b21052795d67'), ki_id=ObjectId('675c4430c2f1f87f8f1050da'), ki_media='file', kind='reuse_created'): docling_extract_and_split-fe71a502-edea-41cb-b7c8-9ce46137d437 cancelled for reason: unknown.
Traceback (most recent call last):
File "/backend/vascular/bin.runfiles/_main/backend/vascular/src/service.py", line 86, in run
await self.process_event(event)
File "/backend/vascular/bin.runfiles/_main/backend/vascular/src/service.py", line 119, in process_event
await future
File "/backend/vascular/bin.runfiles/rules_python~~pip~pypi_312_distributed_py3_none_any_0c75888b/site-packages/distributed/client.py", line 409, in _result
raise exc.with_traceback(tb)
File "/backend/dask/bin.runfiles/_main/backend/dask/src/event_defs.py", line 103, in handle
File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 106, in file_create
File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 209, in process_file
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 376, in compute
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 664, in compute
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_distributed_py3_none_any_0c75888b/site-packages/distributed/client.py", line 2414, in _gather
distributed.client.FutureCancelledError: docling_extract_and_split-fe71a502-edea-41cb-b7c8-9ce46137d437 cancelled for reason: unknown.
[2025-01-21 20:50:24,355 ERROR service] worker 3: handling event FileCreate(timestamp=1737492575, service='file_processor', customer_id=ObjectId('6660a87e4f67b21052795d67'), ki_id=ObjectId('6790085f9828b5848c8a061b'), ki_media='file', kind='upload_created'): write_data_to_mongo-46e60f1a-5a0a-4ead-abb5-b5329740ef53 cancelled for reason: unknown.
Traceback (most recent call last):
File "/backend/vascular/bin.runfiles/_main/backend/vascular/src/service.py", line 86, in run
await self.process_event(event)
File "/backend/vascular/bin.runfiles/_main/backend/vascular/src/service.py", line 119, in process_event
await future
File "/backend/vascular/bin.runfiles/rules_python~~pip~pypi_312_distributed_py3_none_any_0c75888b/site-packages/distributed/client.py", line 409, in _result
raise exc.with_traceback(tb)
File "/backend/dask/bin.runfiles/_main/backend/dask/src/event_defs.py", line 103, in handle
File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 106, in file_create
File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 237, in process_file
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 376, in compute
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 664, in compute
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_distributed_py3_none_any_0c75888b/site-packages/distributed/client.py", line 2414, in _gather
distributed.client.FutureCancelledError: write_data_to_mongo-46e60f1a-5a0a-4ead-abb5-b5329740ef53 cancelled for reason: unknown.
Note that these are 2 different tasks. Second runs after docling.
Worker side failures look this way (no other indications of problem other than these messages in the middle of normal app logs):
[2025-01-21 20:52:33,586 INFO worker] starting task ask_vllm
[2025-01-21 20:52:33,589 DEBUG rs_appcore.redis_lock] priority 0: using single lock
[2025-01-21 20:52:33,590 DEBUG rs_appcore.redis_lock] locking s1_lock:vllm_completions:2/6790091136f4bd8193638c56
2025-01-21 20:52:33,598 - distributed.worker - ERROR - Compute Failed
Key: handle-5ec95939-9d5a-4ea2-9c05-ab8f4d944a27
State: long-running
Function: handle
args: ()
kwargs: {}
Exception: 'FutureCancelledError()'
Traceback: ' File "/backend/dask/bin.runfiles/_main/backend/dask/src/event_defs.py", line 103, in handle\n file_create(self)\n File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 106, in file_create\n ki, status = process_file(event.customer_id, event.ki_id)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 237, in process_file\n ).compute()\n ^^^^^^^^^\n File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 376, in compute\n (result,) = compute(self, traverse=False, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 664, in compute\n results = schedule(dsk, keys, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_distributed_py3_none_any_0c75888b/site-packages/distributed/client.py", line 2414, in _gather\n raise exception.with_traceback(traceback)\n'
[2025-01-21 20:52:33,634 INFO worker] starting task set_ki_status
[2025-01-21 20:52:33,635 DEBUG rs_vllm.completions] sending completions request temp=None seed=210990856
And finally human readable tracebacks from dashboard:
File "/backend/dask/bin.runfiles/_main/backend/dask/src/event_defs.py", line 103, in handle
file_create(self)
File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 106, in file_create
ki, status = process_file(event.customer_id, event.ki_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 237, in process_file
).compute()
^^^^^^^^^
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 376, in compute
(result,) = compute(self, traverse=False, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 664, in compute
results = schedule(dsk, keys, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_distributed_py3_none_any_0c75888b/site-packages/distributed/client.py", line 2414, in _gather
raise exception.with_traceback(traceback)
File "/backend/dask/bin.runfiles/_main/backend/dask/src/event_defs.py", line 103, in handle
file_create(self)
File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 106, in file_create
ki, status = process_file(event.customer_id, event.ki_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/backend/dask/bin.runfiles/_main/backend/dask/src/events/file.py", line 209, in process_file
text, segments = text_extractor.extract_and_split_text().compute()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 376, in compute
(result,) = compute(self, traverse=False, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_dask_py3_none_any_484c317e/site-packages/dask/base.py", line 664, in compute
results = schedule(dsk, keys, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/backend/dask/bin.runfiles/rules_python~~pip~pypi_312_distributed_py3_none_any_0c75888b/site-packages/distributed/client.py", line 2414, in _gather
raise exception.with_traceback(traceback)
All these tracebacks don’t touch app code and only point to compute()
calls.
Scheduler does not mention anything about this. The only suspicious thing is a single message about GIL lock for a few seconds but failures continue happening after that.
These errors happen early during processing so they don’t look like timeouts. ~7% of pipelines are affected by this.
Are there any obvious steps i could take to debug this? Worker debug logs add nothing but heartbeats.
All dask/distributed issues i could find about FutureCancelled
have reason but in my case its unknown
dask/distributed: 2025.1.0
env: k8s cluster, custom docker image for worker and scheduler