Abstract
The effectiveness of parallel and distributed systems depends heavily upon the reliability and efficiency of the method used for information transfer. To satisfy these requirements, the communication medium must supply fault tolerance throughout the communication layers, but should minimize operational overheads. The work, described herein, relates to a scaleable communication system for a distributed-memory parallel processing architecture, which is constructed with message routing switches. The system employs a hardware mechanism that is local to each physical connection, which provides a distributed solution for fault detection and isolation. By isolating faults and the use of adaptive routing algorithms, networks may be designed that will maintain operability in the presence of faults. An explanation of the basic switch and fault isolation mechanism is provided. The paper concludes with implementation details of the operational hardware and details of the environment, in which it has been tested.