This form of synchronization is virtual because the actual situation is more complex than seems to be the case from a programmer's perspective: the ordering property (a so-called synchronous event notification model) can sometimes be violated. For example, the messaging system might sometimes report the same events in different orders at different processes, but only in situations where the processes won't notice. If they are replicating data, the most common use of virtual synchrony, the replicas remain in identical states. The flexibility associated with this limited form of event reordering permits virtual synchrony platforms to achieve extremely high data rates while still preserving very strong fault-tolerance and consistency guarantees.
Each process group has a name, visible within the network. A single application program can connect itself to many groups at the same time. In effect, a process group becomes an abstraction for sharing data, coordinating actions, and monitoring other processes.
The term virtual synchrony refers to the fact that applications see the shared data evolve in what seems to be a synchronous manner. This form of synchronization is virtual because the actual situation is more complex than seems to be the case from a programmer's perspective. Like a compiler that sometimes reorders the execution of instructions for higher performance, or an operating system that sometimes stores random access memory on disk, virtual synchrony sometimes reorders events in ways that improve performance, and yet won't be noticeable to applications.
Using the virtual synchrony model, it is relatively easy to maintain fault-tolerant replicated data in a consistent state. One can then build all sorts of higher level abstractions over the basic replication mechanisms.
Virtual synchrony replication is used mostly when applications are replicating information that evolves extremely rapidly. As discussed further below, the kinds of applications that would need this model include multiuser role-playing games, air traffic control systems, stock exchanges, and telecommunication switches. Of course, there are other ways to solve the same problems. For example, most of today's online multiuser role-playing games give users a sense that they are sharing replicated data, but in fact the data lives in a server on a data center, and any information passes through the data centers. Those games probably wouldn't use models like virtual synchrony, at present. However, as they push towards higher and higher data rates, taking the server out of the critical performance path becomes important, and with this step, models such as virtual synchrony become valuable.
The trend towards cloud computing has increased interest in consistent state replication. Cloud computing systems are large virtualized data centers, operated by internet search or commerce firms such as IBM, Microsoft, Google and Amazon. Inside a cloud computing platform one finds services such as lock management systems (Google's is called Chubby, and Yahoo! uses one called Zookeeper), and these are implemented using virtual synchrony or closely related models. Other services that might be implemented using virtual synchrony include the cluster management tools that relaunch failed applications when nodes in a cluster crash, event notification tools that inform applications when significant events occur, and logging tools that help an application save its state for replay during recovery.
Virtual Synchrony is a popular computing model, closely related to the transactional one-copy serializability model (used mostly in replicated database systems) and the state machine (consensus) model, sometimes known as "Paxos", the name given to the most widely cited state-machine implementation.
The basic goal for all of these protocols is to replicate data in a distributed system in a manner that makes the replicated entity indistinguishable from a non-replicated object implementing the same interface. For example, if we imagine a simple variable, x, that can be read or written to, a replicated version might consist of some set of replicas x0, x1, ... xn and an associated protocol, such that reads and writes to the replicates are performed in a way that looks indistinguishable from reads and writes to the original variable. The challenge is to deal with cases in which multiple updates are initiated concurrently, or where a failure disrupts an update while it is still in progress (the former problem is sometimes called an edit conflict). When we create a process group, the idea is that each of its members will hold a replica. Updates are delivered to the group members through an event notification interface implemented in a way that eliminates these kinds of problems.
The central difference between the three models is that virtual synchrony assumes that the variable is replicated in memory by a set of processes executing on some collection of machines in a network. Transactional one-copy serializability assumes that the data resides in a collection of transactional databases (on disk), and implements the full transactional ACID properties, with the usual begin/commit/abort interface. State-machine consensus lies somewhere in the middle: the variables are assumed to be persistent (for example they might be stored in files), but are not assumed to have full ACID properties, and access is not assumed to go through a transactional begin/commit/abort interface.
None of the three models is particularly difficult to solve in a system where the set of participating processes is stable, and where messages are delivered reliably. However, in real networks, computers can crash or become disconnected and messages can be lost. The need to preserve the properties of the model while masking failures and maintaining high performance is what makes the data replication problem so difficult.
All three models assume that machines may fail by crashing: a computer halts, or some process on it halts, and other processes sense the failure by timeout. Timeout, of course, is a potentially inaccurate way to sense failures (timeouts always discover true crashes, but sometimes a timeout will trigger for some other reason, such as a transient connectivity problem.) A platform implementing any of these models must provide the programmer with a set of system calls that allows him or her to write code that will continue to respect the model even if these kinds of problems occur. In effect, the platform hides this difficult fault-tolerance problem from the programmer.
None of the three models can handle more complex failures, such as machines that are taken over by a virus, or a network that sometimes modifies the messages transmitted. The so-called Byzantine agreement model goes beyond the data replication schemes discussed here by also solving such issues, but does so at a price: Byzantine replication protocols typically require larger numbers of servers, and can be much slower.
Virtual synchrony is useful for more than just replicating data, although replication is probably the most common use. Other mechanisms that can be constructed "over" a virtual synchrony platform include:
Among the three models, virtual synchrony achieves the highest levels of performance, but this comes at a cost: virtual synchrony's fault-tolerance is weaker than other models.
The Paxos and transactional models guarantee a higher degree of durability in the presence of crashes. Both models need to first ensure that an update is recorded in a write-ahead log before any process can actually perform the update. This introduces a form of two-phase commit into the protocol, and hence slows things down: first the update is sent and logged, and all members confirm that they have it, and only then is it performed. In contrast, virtual synchrony implementations with in-memory data replication can generally update a replicated variable as soon as a message describing the update reaches the relevant group members. They can stream high rates of updates by packing multiple updates into a single message.
To give some sense of the relative speed, experiments with 4-node replicated variables undertaken on the Isis and Horus systems in the 1980s suggested that virtual synchrony implementations in typical networks were about 100 times faster than state-machine replication using Paxos, and about 1000 to 10,000 times faster than full-fledged transactional one-copy-serializability. Of course, these sorts of order of magnitude numbers are highly sensitive to the implementation and choice of platform, but they also reflect underlying obligations within the protocols used to implement the models. Modern systems like Spread and Quicksilver can achieve data rates of 10,000 multicasts per second or more, and can scale to large networks with huge numbers of groups or processes.
Most distributed computing platforms support one or more of these models. For example, the widely supported object-oriented CORBA (see also [www.omg.org/gettingstarted/corbafaq.htm OMG] platforms all support transactions and some CORBA products support transactional replication in the one-copy-serializability model. The "CORBA Fault Tolerant Objects standard" is based on the virtual synchrony model. Virtual synchrony was also used in developing the New York Stock Exchange fault-tolerance architecture, the French Air Traffic Control System, the US Navy AEGIS system, IBM's Business Process replication architecture for WebSphere and Microsoft's Windows Clustering architecture for Windows Longhorn enterprise servers.
Virtual synchrony is usually presented to programmers through a simple distributed programming library that supports at least three basic interfaces. First, a process (an executing program) can join a process group. Each group has a name (a bit like a file name, although these names are interpreted within a network, not relative to a disk), and each has a list of members. The join primitive returns some form of handle on the group. The process can then register a handler for incoming events, and can send multicasts to the group.
The basic guarantee associated with the model is that all processes belonging to a group see the same events, in the same order. The platform senses failures (using timeouts) but reports them in a consistent manner to all group members. Multicast messages may be initiated concurrently by multiple senders, but will be delivered in some fixed order selected by the protocols implementing the model.
Notice that the guarantee just described embodies what may seem to be a contradiction. We know that timeout cannot be used to detect failures accurately. Yet virtual synchrony lets the user treat failure notifications (view changes) as trustworthy, infallible events. The key insight is that virtual synchrony is implemented by a software system that creates an abstraction within which the user's code is executed. Thus, failure detection by the platform (using timeouts) triggers an internal agreement protocol (within the platform). Only when this protocol terminates is a fault event delivered to the application. The application is spared from needing to implement the agreement mechanism, and sees a simple and seemingly accurate fault notification event.
Events are of several types. First, each received multicast is delivered as an event. But membership changes in the group are also reported through events; these are called new views of the group. Moreover, when a process joins a group, some existing member is asked to create a checkpoint: a message describing the state of the group at the time the process joined. This is reported to the new member as a state transfer event, and is used to initialize the joining process.
For example, suppose that an air traffic control system maintains some group associated with the airplanes flying in sector XYZ over Paris. Each air traffic controller who monitors that sector would have a process running on his or her machine, and these processes would join the XYZ group as they start up. The members would replicate the list of air traffic control plans and tracks associated with sector XYZ. Upon joining, a process would obtain a copy of the state of the sector as of the moment it joined, delivered as a checkpoint through a state transfer event. Loading such a checkpoint is analogous to reading a file that lists the current state of the sector. Later, as events occur that impact the status of the sector, they would be multicast so that all members of the group can see those events. Since each member is in the same state, and receives the same updates, each air traffic controller sees the same sector status and they see it evolve in the same manner. If a failure occurs, the surviving systems can take over roles that were previously held by the crashed one.
The three executions shown above illustrate the type of event reordering used in virtual synchrony systems. Each shows a set of processes (named p, q, etc) executing as time elapses, from left to right. They interact by exchanging messages, which are shown as arrows from process to process. Notice that the three figures are quite similar but differ in seemingly small ways: in the first figure, the message-passing arrows are vertical, as if the sending of a message was an instantaneous event. In the second figure, the sending of a message takes "time", and we see this because the arrows are now slanted forward. In the third figure, some of the message sending arrows cross one-another.
We will start by looking closely at figure 1 (you may wish to enlarge it so that you can see the arrows clearly). Consider the sequence of events that occur as time elapses, from left to right.
At the start, p creates a process group and is its only member. Then q joins and with p's help, initializes itself. The heavy arrow denotes the creation of a checkpoint by p, which is copied to q, and then loaded by q. Perhaps this group lists air traffic control state for some sector over Paris. Now t, a non-member, asks the group some question. It sends a message, and the group members cooperate to respond (perhaps they each search half of an ATC database -- after all, each knows that the group has two members and each knows its own rank, so parallel computing becomes easy! Next we see some update messages -- multicasts -- exchanged by p and q. Process r joins the group, but q either crashes or fails. Notice that each event is seen in the identical order by all the members. This makes it easy to track the evolving group state. Some would call this a state machine execution.
What makes a virtually synchronous system virtual rather than real is that the synchronous event ordering is actually an illusion. If the timing in the system isn't perfectly synchronized, messages may be delivered with some delays (Figure 2). Now, the delays in question may not be significant to human perception. But how can the application know what order to process the events in? We can't use true clock time for this: even with GPS clocks, synchronization won't be better than a few milliseconds.
In a worst case scenario, events genuinely happen out of order (Figure 3). The point this figure is intended to make is that sometimes, a system can deliver events out of order -- and yet the application might not notice. We'll discuss cases in which this occurs momentarily. By deviating from the synchronous order, virtual synchrony systems gain speed and improve fault-tolerance (they are less likely to experience correlated crashes where some message causes all the members to crash simultaneously).
In virtual synchrony systems, the application programmer signals to the platform what form of ordering is really needed. For example, the programmer might indicate that multicast m updates different data than multicast n. Virtual synchrony software systems make it easy to do this sort of thing, although we won't delve into the details here. Basically, the programmer says "you can deliver messages m and n in any order you like, because my application won't notice". When permitted to do so, the communication system can now save time by not delaying messages under conditions where providing identical delivery order for n and m would have introduced extra cost and thereby slowed down the data rate.
When could we get away with this sort of thing? The answer usually depends on the application. But one good example arises if a group is maintaining data about some collection of objects that tend to be accessed independently. For example, perhaps the group represents the rooms in a multi-user role-playing game. Users are only in one room at a time, hence multicasts that update data in different rooms can be delivered in different orders. If a user sees one such multicast (e.g. that user happens to be in Sarah's Ice Cream shop when the a message is delivered that causes the telephone to ring), they won't see the other one (because it affected the state of some other room). Returning to our air traffic control example, different groups might represent different sectors of the sky, at which point the same kinds of options arise. A programmer designing such an application will often have simple ways to realize that this is the case, and can then signal this through an appropriate system-call.
Why bother? The key question relates to the speed of the application: a communication system gains performance as its ordering obligations are relaxed. Thus, virtual synchrony is motivated by a performance objective. The system seeks to be as fast as an unreliable udp multicast and yet to have strong fault-tolerance and ordering guarantees, similar to those of Paxos.
We mentioned that there is a sense in which virtual synchrony is a weaker model than transactional one-copy serializability or state machine consensus in the style of Paxos. Partly this relates to ordering: virtual synchrony often weakens the message delivery ordering to gain performance. As mentioned above, doing so can sometimes increase robustness too. If different copies sometimes process events in different orders (doing so only when this won't have any impact on the ultimate state of the object), the copies may still be somewhat more robust against messages that cause exceptions. After all, many bugs are exquisitely sensitive to the exact sequence of events that a process experiences, so processes that see the same things but in different orders can often survive problems that would be fatal in some specific ordering.
But the other sense in which virtual synchrony is a weaker model relates to exactly what happens when some process crashes. Suppose that process p sends a multicast to a group G, and then p and some member of the group, say q, both crash. No process that remains operational has a copy of the multicast. What should the platform do?
In virtual synchrony, the group continues executing as if no message was ever sent. After all, there is no evidence to the contrary. P and q have both crashed, so they won't behave in a manner inconsistent with the model. Yet it is possible that q received p's message and delivered it to the application right before the crash. So there is a case in which virtual synchrony seems to lie: it behaves as if no message was sent, and yet the crashed processes might actually have exchanged a message.
This never happens in Paxos or transactional systems, which makes them a good match for updating database files on a disk. In both systems, if q later recovers and rejoins the group, any data it collected prior to crashing will still be valid, except to the extent that it missed updates delivered to the other group members while it was down. The cost of this guarantee is, however, quite high. Asynchronous Paxos, and transactional systems, impose a long delay before any process can deliver a message. First, these platforms make sure that the message reaches all of its destinations, asking them to delay the incoming message before delivering it. Only after the first step is completed are recipients told that it is safe to deliver the message to the application. (In one variant on these models, the platform only makes sure that a quorum (a majority) receive the message, but the delay is comparable).
The delay associated with this extra round of communication can have a big impact on performance.
Experience with virtual synchrony shows that for most applications, the weak but fast form of delivery is just fine. For rare cases where stronger guarantees are needed, the application programmer can request that a slower delivery be performed, paying an infrequent higher price, but only when necessary. The resulting performance will be much higher than if the slower, more conservative delivery property was used for every message.
Virtual synchrony has been supported by the "Isis Toolkit", the "Horus system", the Transis system, the Totem system, an IBM system called Phoenix, a distributed security key management system called Rampart, the "Ensemble system", the Quicksilver system, and a number of products (including the IBM and Microsoft ones mentioned earlier). At the time of this writing, virtual synchrony toolkits that programmers can use to implement new virtually synchronous applications include the Spread Toolkit, the C-Ensemble system, Appia and Quicksilver.
1. Reliable Distributed Systems: Technologies, Web Services and Applications. K.P. Birman. Springer Verlag (1997). Textbook, covers a broad spectrum of distributed computing concepts, including virtual synchrony.
2. Distributed Systems: Principles and Paradigms (2nd Edition). Andrew S. Tanenbaum, Maarten van Steen (2002). Textbook, covers a broad spectrum of distributed computing concepts, including virtual synchrony.
3. "The process group approach to reliable distributed computing" K.P. Birman, Communications of the ACM (CACM) 16:12 (Dec. 1993). Written for non-experts.
4. "Group communication specifications: a comprehensive study" Gregory V. Chockler, Idit Keidar, Roman Vitenberg. ACM Computing Surveys 33:4 (2001). Introduces a mathematical formalism for these kinds of models, then uses it to compare their expressive power and their failure detection assumptions.
5. "Practical Impact of Group Communication Theory." Andre Schiper. Future Directions in Distributed Computing. Springer Verlag Lecture Notes in Computer Science 2584 (July 2005). A history of the area, assumes familiarity with the general topic.
6. "The part-time parliament" Leslie Lamport. ACM Transactions on Computing Systems (TOCS), 16:2 (1998). Introduces the Paxos implementation of replicated state machines. 7. "A review of experiences with reliable multicast" K. P. Birman. Software, Practice and Experience. 29:9 (July 1999). Includes discussion of the New York and Swiss Stock Exchange, French Air Traffic Control System and several other projects that used virtual synchrony as part of a system that was ultimately deployed (in fact with just a few exceptions, these systems are still heavily used).
8. "Exploiting virtual synchrony in distributed systems" K.P. Birman and T. Joseph. Proceedings of the 11th ACM Symposium on Operating systems principles (SOSP), Austin Texas, Nov. 1987. Earliest use of the term, but probably not the best exposition of the topic.