Handling Compute-Node Failures in FLESnet

Salem, F., F. Schintke, and A. Reinefeld, "Handling Compute-Node Failures in FLESnet", CBM Progress Report 2019, pp. 167 – 168, 2020.


FLESnet should tolerate failures of individual compute nodes and should continue to distribute and build timeslices without overly sacrificing the achieved aggregate throughput and latency in large scale deployments. This requires all input nodes to consistently adhere to a new schedule and distribution schema on time and might require the redistribution of not yet entirely built timeslices. We implemented such a fault-tolerance mechanism in our Data-Flow Scheduler that detects and dynamically (re-)assigns the load of failed nodes to the remaining compute nodes. We show the influence of failures on the aggregate bandwidth and latency of timeslice building on our OmniPath testbed.