Fault Tolerant Virtual Machines

Dan Allford
6 min readDec 20, 2020

A summary of a paper (linked at the bottom) describing the initial implementation of the VMware vSphere FT product which provides fault tolerance to virtual machines.

Disaster recovery is a key part of application architecture. In the case of failure of a machine, we would like something better than complete loss of the application. A common modern application design principle is to use stateless applications sitting over reliable/replicated data storage (eg/ AWS S3 or HDFS). However there are numerous reasons this may not be possible (eg/ legacy applications). Given that legacy applications run the world, we would like to be able to improve the failure scenarios of these stateful applications. Often this will be done through regular snapshotting of VMs, although these snapshots will always be at least some what out of date, and will require some time to complete the restore process.

By using a “primary-secondary” (active-passive) set-up for virtual machines the authors enable failover from the primary to the secondary without service interuption or application modification.

With this system, a user of the virtual machine should not be able to detect (aside from an increased latency) that the hardware on which the VM is running has crashed.

The resulting system is benchmarked to incur less than a 10% performance penalty compared to a single, non fault tolerant VM.

The paper only deals with single processor systems, and although the ideas could be extended to multi processor systems, a naive extension would incur massive performance penalties (due to the need to serialise concurrent memory access).

One possible replication scheme would be to replicate all changes that occur to the state of the VM. This however has a significant limitation, it uses substantial bandwidth.

The paper describes what it calls “Determistic Replay” which utilizes a logging channel to send a serialized log of operations the secondary should perform to keep it’s state consistent with the primary. This is refered to as the state machine approach, because it treats the VM as a state machine on which we perform operations, rather than introspecting it’s internal state. This log is generated and shared by the hypervisor and so requires no changes to the VM. Any application which is run in the guest OS becomes fault tolerant. The logging channel is shared via the network.

The logging channel is implemented at the hypervisor level, and sends data from the primary to the secondary.

This log must contain sufficent detail to prevent race conditions which could lead to the states diverging. For example if the Primary reads from system time, the log must contain the value that was read so that the Secondary can stay in sync. Similairly interupts must be performed at the exact same point in the instuction list and the information to do this must be included in the log channel.

In the case of a failover, the VM will ensure all events it recieved from the master have been replayed before it is promoted to the new primary.

The paper discusses reading from disk, and in the default implementation, all data read from disk is also included in the channel. For disk intensive applications, this could lead to the logging channel becoming congested, but simplifies the implementation by removing numerous edge cases, for example that read operations may succeed on the primary but fail on the secondary.

Serializing memory access in multi processor VMs is possible but would come with a massive cost, hence why this paper restricts to a single proccessor.

The secondary VM is transparent to the clients. In the case of failover, automation will update the network to route traffic to the secondary, clients shouldn’t be aware of the failover, except for possible increased latency. In fact one of the significant ideas of the paper is what they call “The Output Rule”.

This rule states that in the case of a failover, the backup should continue executing in a way consistent with the outputs of the primary. i.e. The goal we should strive for is not one of internal state consistency, but that the interaction with the external world is consistent with execution of a single, non failing machine.

In fact this goal is not quite achieved. If a failover occurs in close proximity to a side effect, the secondary may be unsure if the side effect has actually been performed. They address this issue by noting that all network operations should be built in a way such as to handle unreliability of the network, so sending a packet twice should also be of little consequence (handled at the protocol layer). All output operations in a data center VM are probably going to be the sending of packets. Possibly if this VM is connected to other output devices, the application will need to be handle these atleast once semantics.

The output operation is performed twice because the acknowledgement was never recieved by the secondary.

Though it seems to change very little, the output rule allows the primary and secondary VMs to become decoupled in their execution. Operations only need to be acknowledged before a side effect is performed. I.e. intermediate operations dont need to be performed in lockstep, the primary can execute ahead of the secondary, but simply needs to not perform side effects until that side effect has been acknowledged by the secondary.

If the primary fails after a side effect and before it could acknowledge the performance of the side effect, when the back up takes over it should assume the output operation never occurred on the primary (and resend the packet).

The Primary and Secondary VMs run on a shared disk

In the default configuration described in the paper, the Primary and Secondary VMs run on a shared (presumably network accessible) disk. In standard operation, the primary is the only one to interact with the disk, while the secondary keeps its state up to date by simply tailing the logging channel. But in the case of failure of the primary, the secondary will start to write and read to the disk.

Failures are detected using regular heartbeats between the primary and secondary. The shared disk allows the avoidance of a “split brain” scenario where due to a network partition, both VMs think the other is offline and “go live”, which would lead to a violation of the Output Rule. However, as long a the way it does this is by using the shared disk to aquire a lock using the disk, if it cannot aquire this lock, the other VM must have aquired it and be running.

Note, even if the secondary goes live while the primary is still operating, only the secondary will perform any output operations, because the primary will not recieve any further acknowledgments from the secondary — hence the primary might as well be completley offline (from the perspective of the Output rule).

Obviously using a shared disk does mean that both VMs will be taken out together should the disk fail.

The paper also discusses the possibility of running the VMs with seperate disks. The only two changes in this scenario are that the Secondary needs to write to disk, not just ignore the disk write operations, as well as use an alternative mechanism for determination of going live, potentially using some form of consensus algorithm (eg/ ZooKeeper).

In the scheme we described above, the network could drop messages on the logging channel, or the secondary could execute faster than the primary. Both could result in the secondary recieving an interrupt on the logging channel which should have occurred at an instruction number which has already occurred.

I assume that the logging channel includes a sequence number, so that the secondary can impose a strict order on the messages it recieves (and hence detect dropped messages).

In order to prevent the Secondary from executing past an interrupt, the secondary will only execute to the latest log event it has recieved.

Key Takeaways

This paper offers a performant way to add in fault tolerance to any single proccessor applications, without needing to rearchitect.

Replication can be achieved by treating VMs as a deterministic black box (with some work to ensure determinism of some operations, eg/ reading from the system clock)

Operations which contain non-determinism can be made repeatable by including extra information about their outputs.

As long as it isnt performing side effects, the Primary can execute ahead of the Secondary. Side effects should be acknowledged by the Secondary before being performed by the primary. Since side effects are just sending packets over the network, they can be assumed to be idempotent out of the box.

You can find the original paper at https://pdos.csail.mit.edu/6.824/papers/vm-ft.pdf

--

--