Consider an SSHCluster of 58 workers on 8 nodes, where each worker is created as an actor. At some stage, actors send portions of their data to each other (consider it state information synchronization).
I am having a problem where after some time, the program stops abruptly but I have no error logs whatsoever. I do know however that it most likely happened when actor 30 was sending its data to the other workers.
I have a “send_results” method, and a “receive_results” method;
the send_results method is called from the client “n” times, where “n” is the number of workers;
each send_results method call, calls the receive_results method “n-1” times , i.e. once for every worker apart from itself;
(this might be inefficient, in fact each worker wastes an average of 3 seconds to send the data to the other workers, but it was necessary due to the single-threaded nature of Dask actors. An actor cannot send the data and be waiting for a response i.e. completion, while at the same time be processing the data coming from another actor).
both the send and receive methods are equipped with a try except block which looks something like the below:
try:
# logic goes here
except Exception as e:
with open(stack_trace_log_file_name, "w") as fi:
traceback.print_exc(file=fi)
raise
which in my opinion should always log to file locally, as well as propagate up the calling chain:
- if receive_results fails, it would log an error on the local actor on which it crashed;
- as well as propagate to send_results, which would in turn log an error on the local actor which called that remote instance of receive_results,
- as well as propagate to the the client. the client has a similar mechanism to log any error generically. only it doesn’t raise again, it simply logs any caught error to file.
is there anything wrong with the error handling in this case? does not having any error logs in this case constitute a higher likelihood of a network overload issue? this is my only plausible explanation. also, if it wasn’t a network overload, what else could it be? Without any error logs I feel quite lost.