Automatic reconfiguration in Autonet

Download Automatic reconfiguration in Autonet

Preview text

Automatic Reconfiguration in Autonet
Thomas L. Rodeheffer and Michael D. Schroeder September 18, 1991 SRC Research Report 77
This report is an expanded version of a paper that has been accepted for presentation at the ACM Symposium on Operating Systems Principles, October 13-16, 1991.
© Digital Equipment Corporation 1990. This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without any payment of fee is granted for nonprofit, educational, and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of the Systems Research Center of Digital Equipment Corporation in Palo Alto, California; an acknowledgement of the authors and individual contributors to the work; and all applicable portions of the copyright notice. Copying, reproducing, or republishing for any other purpose shall require a license with payment of fee to the Systems Research Center. All rights reserved.

ABSTRACT Autonet is a switch-based local area network using 100 Mbit/s
full-duplex point-to-point links. Crossbar switches are interconnected to other switches and to host controllers in an arbitrary pattern. Switch hardware uses the destination address in each packet to determine the proper outgoing link for the next step in the path from source to destination. Autonet automatically recalculates these forwarding paths in response to failures and additions of network components. This automatic reconfiguration allows the network to continue normal operation without need of human intervention. Reconfiguration occurs quickly enough that higher-level protocols are not disrupted. This paper describes the fault monitoring and topology acquisition mechanisms that are central to automatic reconfiguration in Autonet.

1. Introduction Autonet is a switch-based local area network. In an Autonet,
12-by-12 crossbar switches are interconnected to other switches and to host controllers with 100 Mbit/s full-duplex links in an arbitrary pattern. In normal operation each packet follows a precomputed link-to-link path from source to destination. At each switch, hardware uses the destination address in each packet as the lookup index in a forwarding table to determine the proper outgoing link for the next step in the path. An earlier paper [15] provides an overview of the Autonet design. In the present paper we concentrate on automatic reconfiguration in Autonet.
Automatic operation and high availability are important objectives for Autonet. Our goal was to make Autonet look to host communications software like a fast, high-capacity Ethernet segment that never failed permanently. To provide automatic operation and high availability an Autonet automatically reconfigures itself to use the available topology of switches and links. A processor in each switch monitors the directly connected links and neighboring switches. Whenever this monitor notices a change in what is working (either additions or removals), it triggers a distributed algorithm on all switch processors that determines and distributes the new network topology to all switches. Once each switch knows the new topology, it recalculates routing information and reloads its forwarding table to permit operation with the new topology. This automatic reconfiguration is fast enough that high-level communication protocols are not permanently disrupted, even though client packets may be lost while reconfiguration is in process.
When an Autonet is installed with a redundant topology, automatic reconfiguration allows it to continue to provide full interconnection of all hosts as components fail or are removed from service. If there are so many failures that connectivity is lost, the Autonet will partition, but service will continue within each connected portion. When components are repaired, or the topology is extended with new switches or links, automatic reconfiguration incorporates the added components in the operational network.
Autonet has been the service LAN for our research center since February of 1990, with 31 switches providing service to over

100 hosts. Operational experience has allowed (forced) us to improve the sensitivity, stability, and performance of the automatic reconfiguration mechanisms. So this paper, in addition to giving a more detailed description than we have previously published on reconfiguration in Autonet, also highlights the important changes that were dictated by our experience.
The paper is organized as follows. Section 2 compares Autonet with other networks with automatic reconfiguration. Section 3 gives the overall structure of reconfiguration in Autonet. Section 4 discusses monitoring and Section 5 topology acquisition. Section 6 presents conclusions.
2. Reconfiguration in Other Networks The standard example of automatic reconfiguration in a com-
puter network is the ARPANET [10, 11]. The principle differences between ARPANET and Autonet relate to the fact that ARPANET is designed as a wide-area, moderate-speed network while Autonet is designed as a local-area, high-speed network.
The ARPANET performs store-and-forward routing based on topology descriptions maintained at each switch (IMP), and tolerates temporary forwarding loops by discarding packets if necessary. Each switch regularly broadcasts updates of the status of its local links.
The Autonet switch hardware processes packets first-comefirst-served from each link and uses cut-through to decrease the expected delay through the switch. This design was chosen because it provided the best light-load performance for the simplest hardware. However as a consequence, transient forwarding loops might result in deadlock and thus cannot be tolerated. We do not have efficient means of detecting a deadlock or of clearing one. We took the simplest approach of rapidly recalculating the entire topology whenever it changes and expunging all old forwarding tables before installing any new ones. This global-recalculation design is simpler than incremental approaches and represents an appropriate engineering tradeoff for a moderate-sized network of several dozen switches.
Another network that provides automatic topology maintenance is PARIS [2]. Like ARPANET, PARIS maintains a topology description at each switch via regular broadcasts of local link status

updates. PARIS is designed more as a fast connection network than a packet switching network. Packets travel on explicit source routes which are determined at connection setup by examining a description of the current topology. Topology changes have no effect on existing connections, except that a link failure kills all of the connections using that link. Link updates are distributed reliably and with very high bandwidth by hardware flooding over a spanning tree managed by the software. The software tree management is very careful not to introduce inconsistencies into the tree. In contrast to PARIS, Autonet routes each new packet independently and thus automatically maintains ongoing conversations by routing around link failures and exploiting link recoveries.
Bridged Ethernet [13] is another network that provides automatic reconfiguration. The principle difference from Autonet is that a bridged Ethernet supports multiple-access links with no way to distinguish forwarded packets from originals. A bridged Ethernet carefully maintains a loop-free forwarding tree so that each bridge can deduce what to do with each packet. Although the time constants required to maintain consistency in the forwarding tree are on the order of several seconds, a bridged Ethernet does eventually adapt to any topology change. In contrast, Autonet has an implicit addressing structure induced by its point-to-point links. An arriving packet is always known to be intended for the recipient, at least as an intermediate hop. We also designed a packet encapsulation using network-assigned destination addresses, in order to make forwarding easier (much like Cypress [4]). As a consequence, Autonet uses more forwarding paths and reconfigures much faster than a bridged Ethernet.
3. The Structure of the Automatic Reconfiguration Mechanism
Automatic reconfiguration in Autonet involves three main tasks: monitoring, topology acquisition, and routing. Monitoring involves watching the neighborhood of each switch to determine when the network topology changes. Topology acquisition involves collecting and distributing the description of the network topology. Routing involves recalculating the forwarding table at each switch.

Monitoring determines which links are useful for carrying client packets from one switch to another. From the point of view of reconfiguration, a link is useful if and only if it has an acceptable error rate in both directions, the nodes at each end are distinct, operational switches, and each switch knows the identity of the other. (A switch is identified by a 48-bit unique identifier stored in a ROM.) Topology acquisition and route recalculation is triggered whenever the set of useful links changes. Of course, host-to-switch links also carry client packets, but changes in the state of such links never trigger topology acquisition and route recalculations. At most, changes in the host links to a particular switch cause locally calculated changes in that switch’s forwarding table. So, from the point of view of network reconfiguration, we largely ignore such links.
Monitoring guarantees that topology acquisition (and client) packets will not travel over a link unless both switches agree the link is useful. Because there are two switches involved there will always be transient disagreement whenever the link is changing state, but the monitoring task makes the period of disagreement as brief as possible. A monitor runs independently for each link in each switch and is always active. It will trigger topology acquisition whenever its link changes from useful to not useful or vice versa. The monitors guarantee that, eventually, links remain stable in one state or the other long enough that topology acquisition and routing can finish.
Topology acquisition is responsible for discovering the network topology and delivering a description of it to every switch that is currently part of the network. This task runs in an artificial environment in which changes in link state do not occur. When two switches disagree about the state of a link, the task does not complete. The artificial environment is implemented on top of the monitoring layer by means of an epoch mechanism: any change or inconsistency triggers a new epoch corresponding to a new stable environment. Topology acquisition is a distributed computation that spreads to all switches from the one where a link monitor triggered it.
Routing, the final task of reconfiguration, uses the topology description to compute the forwarding tables for each switch. Because each switch knows the entire topology, each can calculate its own forwarding table. In this paper we are not concerned with the algorithm for constructing the forwarding tables. During topology

acquisition and routing, the switch discards client packets. Once the forwarding table has been recalculated, a switch is able to forward any client packets it receives. The reason for discarding client packets during reconfiguration is to prevent deadlock.
The remainder of the paper concentrates on monitoring and topology acquisition. In considering these topics in more detail we can model the Autonet as a collection of nodes (switches) with numbered ports. Nodes may be interconnected in an arbitrary pattern by full-duplex, port-to-port links. Each node is a computer that can send and receive packets on each attached link that works. Each node has a predetermined unique identifier. From now on we will largely ignore links to hosts, the hosts themselves, forwarding of host packets, and even the forwarding tables in the switches.
The neighborhood of a node N is the set of all useful switchto-switch links that have N as one endpoint. The neighborhood of a set of nodes is the union of the neighborhoods of the members. Autonet reconfiguration can be characterized in terms of these neighborhoods. The monitoring task on node N is responsible for knowing the current neighborhood of N, and for initiating a topology task whenever the neighborhood changes. Topology acquisition works by building a spanning tree, merging neighborhoods of larger and larger subtrees until the root has the neighborhood of the entire graph, and then flooding the topology down the spanning tree to all nodes.
We are now ready to describe monitoring and topology acquisition in more detail. In this discussion, “packet” and “signal” refer to information passing between separate nodes over a connecting link. Packets are regular data packets whereas signals are transported in link protocols below the level of packet traffic. “Message”refers to information passing between software components in a single node.
4. Monitoring The monitoring task imposes a model that allows only two
types of changes in a node’s neighborhood: link failure, which removes a connection from the network topology, and link recovery, which adds a connection. All changes in network interconnection result in some combination of these two types of neighborhood

change. For example, if a technician powers off a switch, all of the adjacent nodes see link failures on their links to the dead node.
The monitoring task responds rapidly to link failures and less rapidly to link recoveries. Link failure, especially abrupt failure, must be detected and reported quickly, because failure can disrupt ongoing client communication. It is not so urgent to rush back into service a link that recently gave problems. Although it is true that link recovery might heal a network partition or increase network capacity, repairing or adding a link usually takes quite a bit of time, so clients usually will not notice a small additional delay. Many networks, for example, the ARPANET, have a delay before placing links back in service [7]. By delaying longer as a link proves its unreliability, we achieve network stability despite intermittent failures.
A useful link is one that allows bidirectional packet transfer with acceptably low error rates between two distinct nodes. The only way this condition can be verified, of course, is for the two nodes to periodically exchange packets, and this is what the monitoring task does. This strategy has the advantage that it is an end-to-end check [14]. It has the disadvantage that failure detection may not be very prompt, because it depends on a timeout whose minimum value is bounded by processing overhead. In order to give prompt detection of expected modes of link failure, the monitoring task also treats certain kinds of hardware errors as indicating failure. For example, if more than three link coding violations are detected in a ten-millisecond interval, the monitoring task immediately assumes that the link has failed.
The monitoring task is organized as two layers: a transmission layer and a connectivity layer. The transmission layer deals with proper transmission and reception of data on the link as seen by the hardware. It makes sure that problems on this link do not interfere with other links and it responds promptly to expected modes of link failure. The connectivity layer, which rests on top of the transmission layer, deals with exchanging packets with the remote node and agreeing on the state of the link. Both of these layers make use of a method for defending against intermittent operation called the skeptic. We describe the skeptic and then the two layers of the monitoring task in detail.

4.1. The Skeptic The skeptic limits the failure rate of a link by delaying its re-
covery if it has a bad history. Without the skeptic, an intermittent link could cause an unlimited amount of disruption: we are especially concerned with limiting the frequency of reconfigurations. There are four requirements in the design of the skeptic: 1) A link with a good history must be allowed to fail and recover several times without significant penalty. 2) In the worst case, a link’s average long-term failure rate must not be allowed to exceed some low rate. 3) Common behaviors shown by bad links should result in exceedingly low average long-term failure rates. 4) A link that stops being bad must eventually be forgiven its bad history.
Requirement 3 distinguishes the skeptic from typical fault isolation and forgiveness methods such as the autorestart mechanism in Hydra [17]. The typical method to meet requirements 1, 2, and 4 sets a quota of say, ten failures per hour, and refuses to recover any link that is over quota. We have observed a common pattern of intermittent behavior in which a link fails again soon after being recovered, in spite of its passing all diagnostics performed in the interim. With the quota method, this pattern would produce a longterm average failure rate of ten failures per hour. This kind of error pattern may not be uncommon, for example Lin and Siewiorek observed a clustering pattern of transient errors in the VICE file system [9].
The skeptic can be used with any object whose status may change intermittently: it provides a “filtered object” whose rate of status change is limited. As seen by the skeptic, an object is an abstraction that emits a series of messages, each of which says either “working” or “broken”. The skeptic in turn sends out a filtered version of these messages to the next higher level of abstraction. This operation is shown in Figure 1.
4.1.1. Details of the skeptic The skeptic is a state machine with auxiliary variables, timers,
and policy parameters, as is shown in Figure 2. Dead state means that the subordinate object is broken, wait state means that the object is working but the skeptic is delaying for a while before passing on

that information, and good state means that the object is working and the skeptic has concurred.

subordinate object

“working” “broken”


“working” “broken”

filtered object
Figure 1: The concept of the skeptic.


recv “broken” send “broken”
+1 level


–1 level

skepticism level

recv “broken”

recv “working”


wait timer expires send “working”

Figure 2: The internals of the skeptic.
Three of the state transitions are caused by messages from the subordinate object. When the skeptic is in wait state or good state and it receives a “broken” message, it moves to dead state. When the skeptic is in dead state and it receives a “working” message, it moves to wait state. Otherwise the messages have no effect. The only other transition in the state machine happens when the wait timer expires; in this case the skeptic moves from wait state to good state.


Preparing to load PDF file. please wait...

0 of 0
Automatic reconfiguration in Autonet