Farouk Salem

Salem, F., F. Schintke, T. Schütt, and A. Reinefeld, Data-flow scheduling for a scalable FLESnet, , Darmstadt, Zuse Insitute Berlin, pp. 130-131, 04/2018. Abstract

FLES is the first-level event selector of the CBM project. For the timeslice building, the FLESnet software uses
tightly coupled processes. Unfortunately, slow processes and network congestion cause a serious performance bottleneck, especially in large systems. We, therefore, developed a scheduling mechanism to minimize the synchronization overhead and enhance the scalability. We highlight the main components of our new scheduling mechanism and we show how it enhances the performance of FLESnet on large deployments.

Salem, F., F. Schintke, T. Schütt, and A. Reinefeld, Supporting various interconnects in FLESnet using Libfabric, , Darmstadt, Germany, GSI, pp. 159 -- 160, 2017. Abstract

FLES is the first-level event selector of the CBM project. The system prototype, FLESnet, natively uses InfiniBand for communication, which limits the portability to other interconnects that may be of interest when the actual analysis cluster is built. We adopt FLESnet to Libfabric, a generic networking framework, and explain how this framework is used for time-slice building with efficient, zero-copy data transfers. We discuss preliminary benchmarking results of the new implementation using the Cray GNI interconnect with Intel Xeon Phi processors.

Salem, F., F. Schintke, T. Schütt, and A. Reinefeld, "Scheduling Data Streams for Low Latency and High Throughput on a Cray XC40 Using Libfabric", Concurrency and Computation Practice and Experience, pp. 1-14, 2019. Abstract

Achieving efficient many-to-many communication on a given network topology is a challenging task when many data streams from different sources have to be scattered concurrently to many destinations with low variance in arrival times. In such scenarios, it is critical to saturate but not to congest the bisectional bandwidth of the network topology in order to achieve a good aggregate throughput. When there are many concurrent point-to-point connections, the communication pattern needs to be dynamically scheduled in a fine-grained manner to avoid network congestion (links, switches), overload in the node’s incoming links, and receive buffer overflow. Motivated by the use case of the Compressed Baryonic Matter experiment (CBM), we study the performance and variance of such communication patterns on a Cray XC40 with different routing schemes and scheduling approaches. We present a distributed Data Flow Scheduler (DFS) that reduces the variance of arrival times from all sources at least 30 times and increases the achieved aggregate bandwidth by up to 50 %.

Salem, F., F. Schintke, T. Schütt, and A. Reinefeld, "Scheduling data streams for low latency and high throughput on a Cray XC40 using Libfabric", CUG Conference Proceedings, Canada, 2019. Abstract

Salem, F., F. Schintke, T. Schütt, and A. Reinefeld, "Improving the throughput of a scalable FLESnet using the Data-Flow Scheduler", CBM Progress Report 2018, pp. 149 – 150, 2019. Abstract

Minimizing the latency is essential for FLESnet to achieve good aggregated bandwidth and we already improved the throughput and lowered the latency with our Data-Flow Scheduler. However, there is still a gap between the achieved and the maximally achievable bandwidth. For timeslice building, we found that FLESnet performs two RDMA writes for each contribution. This congests the network unnecessarily and increases the latency, especially in large deployments. We, therefore, optimized FLESnet to need only one RDMA request per contribution. We show how the aggregated bandwidth is increased when the scheduler is used. We also discuss the performance of the scheduler on an Infiniband cluster.

Salem, F., F. Schintke, and A. Reinefeld, "Handling Compute-Node Failures in FLESnet", CBM Progress Report 2019, pp. 167 – 168, 2020. Abstract

FLESnet should tolerate failures of individual compute nodes and should continue to distribute and build timeslices without overly sacrificing the achieved aggregate throughput and latency in large scale deployments. This requires all input nodes to consistently adhere to a new schedule and distribution schema on time and might require the redistribution of not yet entirely built timeslices. We implemented such a fault-tolerance mechanism in our Data-Flow Scheduler that detects and dynamically (re-)assigns the load of failed nodes to the remaining compute nodes. We show the influence of failures on the aggregate bandwidth and latency of timeslice building on our OmniPath testbed.

Salem, F., Comparative Analysis of Big Data Stream Processing Systems, , Espoo, Aalto University, 2016. streamingsystems.pdf

Salem, F., M. Azouz, and A. Ahmed, Online Expert System, , Cairo, Cairo University, 2010.

CS241: Operating Systems

Semester:

Fall

CS214: Data Structure

Semester:

Spring

Farouk Salem

Lecturer Assistant, Computer Science Department, Faculty of Computers and Artificial Intelligence

CS241: Operating Systems

CS214: Data Structure

Upcoming events