Connection Synchronisation (TCP Fail-Over)

Horms (Simon Horman) - horms@valinux.co.jp
VA Linux Systems Japan, K.K. - www.valinux.co.jp
with assistance from
NTT Comware Corporation - www.nttcom.co.jp

November 2002, Revised November 2003

Abstract

Load Balancing using Layer 4 Switching is a very popular method of creating large Internet sites using commodity hardware. To avoid having a single point of failure two directors are often deployed in an active/stand-by configuration to load balance a site. However, when fail-over occurs, active connections will break unless they are synchronised between the two directors. This paper examines how the current connection synchronisation code works in the Linux Virtual Server and looks at improvements that have been made to this code-base.

Contents

Motivation

Load Balancing using Layer 4 Switching, as provided by The Linux Virtual Server Project (LVS)[1], F5 Networks[2][3] Foundry Networks[4][5] and others is a powerful tool that allows networked services to scale beyond a single machine[6]. This is the prevailing technology for building large web sites from small, commodity servers. Many sites use this technology in one form or another. Examples include Slashdot.Org[7] (Foundry), Internet Initiative Japan (IIJ)[8] (F5) and Porsche.Com[9] (LVS).

Layer 4 Switching can, however, introduce a single point of failure into a system - all traffic must go through the linux director when it enters the network, and often when it leaves the network. Thus if the linux director fails then the load balanced site becomes unavailable. To alleviate this problem it is common to have two linux directors in an active/stand-by configuration.

Figure 1: One Director, Single Point of Failure


Figure2: Active/Stand-By Directors

In this scenario the active linux director accepts packets for the service and forwards them to the real servers. The stand-by linux director is idle. If the active linux director fails or is taken off-line for maintenance, then the stand-by becomes the active linux director. However, when such a fail-over occurs, any existing connections will be broken.

Connection Synchronisation resolves this problem by synchronising information about active connections to the stand-by linux director. Thus, when a fail-over occurs, the new active linux director is able to load balance subsequent packets for these connections.

This project allows connections to a virtual service to continue so long as one linux director is online and can be made the active linux director. The design is such that regardless of which linux director is active or how many times a fail-over occurs, the connection is able to continue. While the focus of development has been with two linux directors in an active/stand-by configuration. The design will work with any number or linux directors. It also allows for the possibility that more than one linux director may be active[10].

LVS Overview

When a linux director receives a packet for a new connection it allocates the connection to a real server. This allocation is effected by allocating an ip_vs_conn structure in the kernel which stores the source address and port, the address and port of the virtual service, and the real server address and port of the connection. Each time a subsequent packet for this connection is received, this structure is looked up and the packet is forwarded accordingly. This structure is also used to reverse the translation process which occurs when NAT (Network Address Translation) is used to forward a packet and to store persistence information. Persistence is a feature of LVS whereby subsequent connections from the same end-user are forwarded to the same real server.

When fail-over occurs, the new active linux director does not have the ip_vs_conn structures for the active connections. So when a packet is received for one of these connections, the linux director does not know which real server to send it to. Thus, the connection breaks and the end-user needs to reconnect. By synchronising the ip_vs_conn structures between the linux directors this situation can be avoided, and connections can continue after a linux director fails-over.

Master/Slave Problem

The existing LVS connection code relies on a sync-master/sync-slave setup where the sync-master sends synchronisation information and the sync-slaves listen. This means that if a sync-slave linux director becomes the active linux director, then connections made will not be synchronised. This is illustrated by the following scenario.

There are two linux directors, director-a and director-b. Director-A is the LVS sync-master and the active linux director. Director-B is an LVS sync-slave and the stand-by linux director.

1. An end user opens connection-1. Director-A receives this connection, forwards it to a real server and synchronises it to the sync-slave, director-b.

2. A fail-over occurs and director-b becomes the active linux director. Connection-1 is able to continue because of the connection synchronisation that took place in step 1.

3. An end user opens connection-2. Director-B receives this connection, and forwards it to a real server. Connection synchronisation does not take place because director-b is a sync-slave.

4. Another fail-over takes place and director-a is once again the active linux director. Connection-2 is unable to continue because it was not synchronised.

Patches are available for LVS which allow a linux director to run as a sync-master and sync-slave simultaneously. This overcomes the above problem. The patches have been integrated into the LVS code for the Linux 2.6 kernel since LVS-1.1.6. They have not been integrated into the LVS code for the Linux 2.4 kernel as of LVS-1.0.10 but the patches can be found on the lvs-users' mailing list[1].

As different approach to this problem, a peer to peer synchronisation system has been developed. In the system developed each linux director sends synchronisation information for connections that it is handling to all other linux directors. Thus, in the scenario above connections would be synchronised from director-b to director-a and connection-2 would be able to continue after the second fail-over when director-a becomes the active director again.

Protocol Deficiencies

LVS's existing connection synchronisation code uses a very simple packet structure and protocol to transmit synchronisation information. The packet starts with a 4 byte header followed by up to 50 connection entries. The connection entries may be 24 or 48 bytes long. The author has observed that in practice, most synchronisation connection entries are 24 bytes, not 48 bytes. Thus, the 50 connection entry limit results in most packets having 1204 bytes of data.

There are several aspects of this design that can be improved without impacting on its simplicity: There is no checksum in the packet. This makes it impossible to detect corruption of data. There is no version field in the packet. This may make it difficult to make subsequent changes to the packet structure. On networks with an MTU significantly greater than 1204 bytes larger packets may be desirable.

Implementation

Block Diagram
The implementation moves much of the the synchronisation daemon logic out of the kernel and into user-space as it is believed that more sophisticated synchronisation daemons can be developed more rapidly in user-space than in kernel space. The main disadvantage of moving the synchronisation logic into user-space is that there is a possible performance impact. In particular, there is some overhead in passing the synchronisation information to user space, especially as the user-space daemon's multicast communication must pass back through the kernel. However, testing showed that the netlink socket is very fast [
appendix a].

LVS Core

LVS's existing synchronisation code works by feeding synchronisation information to ipvs syncmaster which in turn sends it out onto the network. ipvs syncslave listens on for synchronisation packets to arrive via multicast and feeds them into the LVS core. ipvs syncmaster and ipvs syncslave should be started on the linux directors using ipvsadm.

By default a connection is passed to ipvs syncmaster once it passes a threshold of 3 packets, and then is resent to ipvs syncmaster at a frequency of once every 50 packets after that. The threshold may be manipulated using the existing /proc/sys/net/vs/sync_threshold. To allow the frequency to be manipulated, LVS has been patched to add the /proc/sys/net/vs/sync_frequency proc entry. As synchronisation information for a given connection is resent every 50 packets it is not considered of particular importance if occasionally the synchronisation packets are dropped. This is important, as multicast UDP which does not have an underlying retransmission method for dropped packets, is used for inter-linux director communication.

ipvs syncmaster

The LVS synchronisation code was patched to abstract the sending and receiving of synchronisation packets. This allows different methods of sending and receiving synchronisation packets to be used without further modification to this code. The following hooks are provided:

send_mesg Used by ipvs syncmaster to send a packet out onto the network.
open_send Used by ipvs syncmaster to open the socket used to send packets to the network.
close_send Used by ipvs syncmaster to close the socket used to send packets to the network.
recv_loop Event loop used by ipvs syncslave to receive packets from the network.
open_recv Used by ipvs syncslave to open the socket that is used to receive packets from the network.
close_recv Used by ipvs syncslave to close the socket that is used to receive packets from the network.

The following functions are provided to register functions for these hooks. It is envisaged that the hook functions would be implemented in a separate kernel module. When the module initialises itself ip_vs_sync_table_register should be called. When the module is unregistering itself, ip_vs_sync_table_register_default should be called.

ip_vs_sync_table_register Used to register the hooks above. If NULL is supplied for any of the hooks, then the hook will have no effect.
ip_vs_sync_table_register_default Used to register the default hooks, which gives the existing behaviour of of directly sending and receiving multicast packets. This behaviour is registered when LVS is initialised.

The proc entry /proc/net/vs/sync_msg_max_size was added to allow the maximum size of messages sent by ipvs syncmaster to be modified. In situations where the linux director is under load, this will be the size of most synchronisation packets. The default is 1228 bytes which reflects the old hard-coded value. The intention of this is to allow more connections to be synchronised in a single packets on networks with an MTU significantly larger than 1228 bytes, such as ethernet networks with jumbo 6000 byte frames and netlink sockets which have an MTU of 64Kbytes.

ip_vs_user_sync

Using the hooks that were added to the LVS synchronisation code, a method that sends and receives synchronisation packets via a netlink socket was written. This was implemented as a separate kernel module, ip_vs_user_sync. When this module is inserted into the kernel it registers itself as the synchronisation method and starts the kernel synchronisation daemon ipvs syncmaster. The send_mesg hook registered passes synchronisation packets to user-space. The ip_vs_user_sync module listens for synchronisation information from user-space and passes it directly to the LVS core. Thus there is no need for an ipvs syncslave process and ipvs syncmaster can be run on all linux directors. This is important to allow a peer-to-peer relationship to be established between linux directors in place of a master/slave relationship.

The ipvs syncmaster kernel daemon is started when ip_vs_user_sync is initialised. Thus, unlike the existing LVS synchronisation code, this daemon should not be started using ipvsadm.

Netlink Socket

The user-space daemon communicates with its kernel counterpart using an netlink socket. This requires a small patch to the kernel to add the NETLINK_IPVS protocol. Communication using this socket is somewhat analogous to UDP: Packets of up to 64Kbytes may be sent. Packets may be dropped in situations of high load, though unlike UDP this results in an error condition. Unlike UDP the data in packets received can be assumed to be uncorrupted. Unless there is broken memory in the machine, which should result in other, more dire problems. Importantly only local processes owned by root may communicate with the kernel using the netlink socket. Thus there is some level of implicit authenticity of the user-space client.

Testing of the Netlink Socket as shown in appendix a shows that there is a sweet spot for throughput at a packet size of around 7200bytes. For this reason it is suggested that the maximum size of synchronisation packet sent by ipvs syncmaster be set to this value using /proc/net/vs/sync_msg_max_size.

libip_vs_user_sync

This is a library that provides a convenient way for user-space applications to communicate with the kernel using the NETLINK_IPVS netlink socket. This provides calls analogous to libc's send(), and recv() as well as defining a simple packet structure which is also used by ip_vs_user_sync.

ip_vs_user_sync_simple

A user-space synchronisation daemon. It uses libip_vs_user_sync to access a netlink socket to listen for synchronisation information from the kernel. It then sends this information to other linux directors using multicast. The packet format used to send these packets has a version field to allow packets from a different version of the protocol to easily be identified and dropped. All nodes running this daemon can send and receive this synchronisation information. Thus there is no sync-master/sync-slave relationship.

This is intended as a bare-bones synchronisation daemon that illustrates how libip_vs_user_sync can be used in conjunction with ip_vs_user_sync to create a user-space synchronisation daemon. Its key advantage over the existing LVS synchronisation code is that eliminates the sync-master/sync-slave and the problems that can introduce. It is suitable for use in both active/stand-by with two linux directors and active-active configurations with any number of linux directors.

This daemon, like the existing LVS synchronisation code, has no security protection. Thus interfaces listening for synchronisation packets over multicast should be protected from packets from untrusted hosts. This can be done by filtering at the gateway to the network, or using a private network for synchronisation traffic. It is not sufficient to use packet filtering on an exposed interface, as it is trivial for a would-be attacker to spoof the source address of a packet.

Sample Configuration

Sample Topology

The topology has two linux directors, in an active/stand-by configuration. There is a Virtual IP address (VIP) on the external network which is the IP address that end-users connect to and should be advertised in DNS. There is also a VIP on the internal network. This is used as the default route for the real servers. The VIPs are administered by Heartbeat [11] so that they belong to the active linux director. ip_vs_user_sync_simple runs on both of the linux directors to synchronise connection information between them. NAT (Network Address Translation) is used to forward packets to the two real servers. More real servers may be added to the network if additional capacity is required.

Given the flexibility of LVS, there are many different ways of configuring load balancing using LVS. This is one of them. Details on configuring LVS can be found in the LVS HOWTO[12] and several sample topologies can be found in the Ultra Monkey documentation[13]. The information in these documents, combined with this paper should be sufficient to configure load balancing with connection synchronisation for a wide range of network topologies.

Network Preparation

The documentation that follows assumes that all nodes on the network are set up with correct interfaces and routes for each network they are connected to as per the diagram above. The return path for packets must be through the active linux director. In most cases this will mean that the the default route should be set to the VIP.

Linux Directors

Packet Forwarding

The linux directors must be able to route traffic from the external network to the server network and vice versa. Specifically, in addition to correctly configuring the interfaces and routes you must enable IPV4 forwarding. This is done by modifying the line containing net.ipv4.ip_forward in /etc/sysctl.conf as follows. Alternatively the corresponding /proc value may be manipulated directly.

net.ipv4.ip_forward = 1
For this change to take effect run: sysctl -p

Heartbeat

Heartbeat runs on the two linux directors and handles bringing up the interface for the VIPs. To configure heartbeat /etc/ha.d/ha.cf, /etc/ha.d/haresources and /etc/ha.d/authkeys must be installed. The node names in /etc/ha.d/ha.cf and /etc/ha.d/haresources must be set according to the output of the uname -n command on each linux director. The key in /etc/ha.d/authkeys, should be modified to something confidential to the site. It is highly recommended that heartbeat be run over at least two links to avoid a single link-failure resulting in a fail-over. More information on configuring these files can be found in the documentation an sample configuration files supplied with Heartbeat.

To start heartbeat run: /etc/init.d/heartbeat start

After a few moments heartbeat should bring up an IP alias for the VIP on the first master linux director. This can be verified using the ifconfig command. The output of the following command has been truncated to only show the eth0:0 and eth1:0 interfaces. Depending on the setup of the host it is possible that heartbeat will use difference interfaces.

/sbin/ifconfig
eth0:0    Link encap:Ethernet  HWaddr 00:D0:B7:BE:6B:CF  
          inet addr:192.168.6.240  Bcast:192.168.6.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:17 Base address:0xef00 

eth1:0    Link encap:Ethernet  HWaddr 00:90:27:74:84:ED  
          inet addr:192.168.7.240  Bcast:192.168.7.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:18 Base address:0xee80 

Ldirectord

The monitoring of real servers, and their insertion and removal from the pool of servers available is controlled by ldirectord. To configure ldirectord, /etc/ha.d/ldirectord.cf must be installed. Information on customising this file can be found in the ldirectord(8) man page and in the sample configuration supplied with ldirectord.

To start ldirectord run: /etc/init.d/ldirectord start

Ldirectord should initialise the the current LVS kernel table. To inspect this use ipvsadm. For example:

/sbin/ipvsadm -L -n
IP Virtual Server version 0.9.16 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  192.168.7.240:443 rr
  -> 192.168.6.4:443            Masq    1      0          0         
  -> 192.168.6.5:443            Masq    1      0          0         

Netfilter

As NAT is being used to forward packets to the real servers, netfilter on linux director should be configured accordingly:

# Flush existing rules
/sbin/iptables -F

# NAT for 192.168.6.0/24 bound for any host
/sbin/iptables -t nat -A POSTROUTING -j MASQUERADE -s 192.168.6.0/24

# Log all packets that attempt to be forwarded
# Useful for Debugging. Questionable for Production
#/sbin/iptables -t nat -A POSTROUTING -j LOG

# View the rules
# Truncated to only show only the POSTROUTING chain in the nat table
/etc/init.d/iptables status
Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  192.168.6.0/24       anywhere
If an FTP Virtual Service is to be used then the ip_vs_ftp kernel module needs to be used:
/sbin/modprobe ip_vs_ftp

Synchronisation Daemon

The synchronisation daemon ip_vs_user_sync_simple is configured by /etc/ip_vs_user_sync_simple.conf.

To start ip_vs_user_sync_simple run: /etc/init.d/ip_vs_user_sync_simple

Optionally, the the communication overhead between the kernel and user-space synchronisation daemons, can be slightly reduced by increasing the maximum packet size for packets sent by the kernel daemon from 1228bytes to 7200 bytes. To do this, add the following line to /etc/sysctl.conf or by manipulating the corresponding /proc entry directly.

net.ipv4.vs.sync_msg_max_size = 7200
For this change to take effect run: sysctl -p

Real Servers

The real servers should be configured to run the underlying services for their respective virtual services. For instance, an HTTP daemon. In addition the the "request" URLs as specified in /etc/ha.d/ldirectord.cf should be present and contain the "receive" string. The real servers also need to be set up so that their default route is set to the VIP on the server network.

Result

When an end-user make a connection to the external VIP, 192.168.7.240 this should be received by the active linux director. The linux director will then allocate the connection to one of the real servers and forward packets to the real server for the life of the connection. It will also synchronise the connection to the other director using the synchronisation daemon. If the active linux director fails then the stand-by linux director should assume the VIP, then as this director has information about all the active connections, from the connection synchronisation daemon, any active connections will be able to continue.

Conclusion

Connection synchronisation between linux directors allows connections to continue after a linux director fail-over occurs. This can occur either if the active linux director fails or is taken down for maintenance.

The method implemented establishes a peer-to-peer relationship between the linux directors using multicast. This allows connections being handled by any linux director to be efficiently synchronised to all other available directors. Thus, no mater which director is active, if it becomes unavailable and at least one other director is available, then fail-over should occur and active connections should continue.

The implementation modifies the existing LVS synchronisation code to allow different synchronisation methods to be registered. The method registered forwards synchronisation information to a user space daemon, where it is processed and distributed over multicast. Thus, much of the synchronisation logic was moved out of the kernel and into user-space, allowing a more sophisticated daemon to be built using existing user-space libraries and debugging tools.

Appendix

Appendix A: Netlink Socket Throughput

The performance of netlink socket communication between user-space and the kernel is quite fast as shown in the charts below. The first chart plots transfer speed against packet size for packets from 0 to 16Kbytes. The second graph plots the same data for packets from 0 to 64Kbytes, the maximum packet size for netlink sockets.

The graphs show that on a Pentium III 800MHz transfer rates of in excess of 660000Kbytes/s (5.0Gbits/s) are attainable for packets over 3100bytes in size. The graphs also indicate that there is a sweet spot at 7200bytes. At this point a transfer rate of 920000Kbits/s (7.2Gbits/s) is obtained.

Bibliography

[1] Wensong Zhang et al. 1998-2002. "Linux Virtual Server Project". http://www.linuxvirtualserver.org/.

[2] F5 Networks, Inc. 1999-2002. "F5 Networks". http://www.f5.com/.

[3] F5 Networks. 2002. "F5 Networks". http://www.f5networks.co.jp/.

[4] Foundry Networks, Inc. 2002. "Foundry Networks". http://www.foundrynetworks.com/.

[5] Foundry Networks. 2002. "Foundry Networks". http://www.foundrynetworks.co.jp/.

[6] Simon Horman. 2000. "Creating Linux Web Farms". http://www.vergenet.net/linux/.

[7] OSDN. 1997-2002. "Slashdot.Org". http://slashdot.org/.

[8] Internet Initiative Japan Inc. 1996. "Internet Initiative Japan (IIJ)". http://www.iij.ad.jp/.

[9] Dr. Ing.h.c.F.Porsch AG. 2002. "Porsche.Com". http://www.porsche.com/.

[10] Simon Horman. 2002. "Saru: Active-Active Load Balancing". http://www.ultramonkey.org/.

[11] Alan Robertson et al. 1999-2002. "Heartbeat". http://www.linux-ha.org/heartbeat/.

[12] Joseph Mack. 1999-2003. "LVS-HOWTO". http://www.linuxvirtaulserver.org/.

[13] Simon Horman. 2002. "Ultra Monkey - High Availability and Load Balancing Solution for Linux, 2.0.0. http://www.ultramonkey.org/.