Connection Synchronisation (TCP Fail-Over)Horms (Simon Horman) - horms@valinux.co.jpVA Linux Systems Japan, K.K. - www.valinux.co.jp with assistance from NTT Comware Corporation - www.nttcom.co.jp November 2002, Revised November 2003 |
| Abstract |
|
Load Balancing using Layer 4 Switching is a very popular method of creating large Internet sites using commodity hardware. To avoid having a single point of failure two directors are often deployed in an active/stand-by configuration to load balance a site. However, when fail-over occurs, active connections will break unless they are synchronised between the two directors. This paper examines how the current connection synchronisation code works in the Linux Virtual Server and looks at improvements that have been made to this code-base. |
Load Balancing using Layer 4 Switching, as provided by The Linux Virtual Server Project (LVS)[1], F5 Networks[2][3] Foundry Networks[4][5] and others is a powerful tool that allows networked services to scale beyond a single machine[6]. This is the prevailing technology for building large web sites from small, commodity servers. Many sites use this technology in one form or another. Examples include Slashdot.Org[7] (Foundry), Internet Initiative Japan (IIJ)[8] (F5) and Porsche.Com[9] (LVS).
Layer 4 Switching can, however, introduce a single point of failure into a system - all traffic must go through the linux director when it enters the network, and often when it leaves the network. Thus if the linux director fails then the load balanced site becomes unavailable. To alleviate this problem it is common to have two linux directors in an active/stand-by configuration.
Figure 1: One Director, Single Point of Failure
| |
![]()
Figure2: Active/Stand-By Directors
| |
In this scenario the active linux director accepts packets for the service and forwards them to the real servers. The stand-by linux director is idle. If the active linux director fails or is taken off-line for maintenance, then the stand-by becomes the active linux director. However, when such a fail-over occurs, any existing connections will be broken.
Connection Synchronisation resolves this problem by synchronising information about active connections to the stand-by linux director. Thus, when a fail-over occurs, the new active linux director is able to load balance subsequent packets for these connections.
This project allows connections to a virtual service to continue so long as
one linux director is online and can be made the active
linux director. The design is such that regardless of which
linux director is active or how many times a fail-over occurs, the
connection is able to continue. While the focus of development has been
with two linux directors in an active/stand-by configuration. The
design will work with any number or linux directors. It also allows
for the possibility that more than one linux director may be active[10].
When a linux director receives a packet for a new connection it
allocates the connection to a real server. This allocation is effected
by allocating an ip_vs_conn structure in the kernel which stores the source
address and port, the address and port of the virtual service, and the
real server address and port of the connection. Each time a subsequent
packet for this connection is received, this structure is looked up and the
packet is forwarded accordingly. This structure is also used to reverse the
translation process which occurs when NAT (Network Address Translation) is
used to forward a packet and to store persistence information. Persistence
is a feature of LVS whereby subsequent connections from the same end-user
are forwarded to the same real server.
When fail-over occurs, the new active linux director does not have the
ip_vs_conn structures for the active connections. So when a packet is
received for one of these connections, the linux director does not know
which real server to send it to. Thus, the connection breaks and the
end-user needs to reconnect. By synchronising the ip_vs_conn structures
between the linux directors this situation can be avoided, and connections
can continue after a linux director fails-over.
The existing LVS connection code relies on a sync-master/sync-slave setup
where the sync-master sends synchronisation information and the sync-slaves
listen. This means that if a sync-slave linux director becomes the
active linux director, then connections made will not be synchronised.
This is illustrated by the following scenario.
LVS Overview
Master/Slave Problem
| There are two linux directors, director-a and director-b. Director-A is the LVS sync-master and the active linux director. Director-B is an LVS sync-slave and the stand-by linux director. | ||
| ||
| ||
| ||
|
Patches are available for LVS which allow a linux director to run as a sync-master and sync-slave simultaneously. This overcomes the above problem. The patches have been integrated into the LVS code for the Linux 2.6 kernel since LVS-1.1.6. They have not been integrated into the LVS code for the Linux 2.4 kernel as of LVS-1.0.10 but the patches can be found on the lvs-users' mailing list[1].
As different approach to this problem, a peer to peer synchronisation
system has been developed. In the system developed each linux director
sends synchronisation information for connections that it is handling to
all other linux directors. Thus, in the scenario above connections
would be synchronised from director-b to director-a and connection-2 would
be able to continue after the second fail-over when director-a becomes the
active director again.
LVS's existing connection synchronisation code uses a very simple
packet structure and protocol to transmit synchronisation information.
The packet starts with a 4 byte header followed by up to 50 connection
entries. The connection entries may be 24 or 48 bytes long.
The author has observed that in practice,
most synchronisation connection entries are 24 bytes,
not 48 bytes. Thus, the 50 connection entry limit results
in most packets having 1204 bytes of data.
There are several aspects of this design that can be improved without
impacting on its simplicity: There is no checksum in the packet. This makes
it impossible to detect corruption of data. There is no version field in
the packet. This may make it difficult to make subsequent changes to the
packet structure. On networks with an MTU significantly greater than 1204
bytes larger packets may be desirable.
Protocol Deficiencies
Implementation
|
| Block Diagram |
LVS's existing synchronisation code works by feeding synchronisation information to ipvs syncmaster which in turn sends it out onto the network. ipvs syncslave listens on for synchronisation packets to arrive via multicast and feeds them into the LVS core. ipvs syncmaster and ipvs syncslave should be started on the linux directors using ipvsadm.
By default a connection is passed to ipvs syncmaster once it passes a threshold of 3 packets, and then is resent to ipvs syncmaster at a frequency of once every 50 packets after that. The threshold may be manipulated using the existing /proc/sys/net/vs/sync_threshold. To allow the frequency to be manipulated, LVS has been patched to add the /proc/sys/net/vs/sync_frequency proc entry. As synchronisation information for a given connection is resent every 50 packets it is not considered of particular importance if occasionally the synchronisation packets are dropped. This is important, as multicast UDP which does not have an underlying retransmission method for dropped packets, is used for inter-linux director communication.
| send_mesg | Used by ipvs syncmaster to send a packet out onto the network. |
| open_send | Used by ipvs syncmaster to open the socket used to send packets to the network. |
| close_send | Used by ipvs syncmaster to close the socket used to send packets to the network. |
| recv_loop | Event loop used by ipvs syncslave to receive packets from the network. |
| open_recv | Used by ipvs syncslave to open the socket that is used to receive packets from the network. |
| close_recv | Used by ipvs syncslave to close the socket that is used to receive packets from the network. |
The following functions are provided to register functions for these hooks. It is envisaged that the hook functions would be implemented in a separate kernel module. When the module initialises itself ip_vs_sync_table_register should be called. When the module is unregistering itself, ip_vs_sync_table_register_default should be called.
| ip_vs_sync_table_register | Used to register the hooks above. If NULL is supplied for any of the hooks, then the hook will have no effect. |
| ip_vs_sync_table_register_default | Used to register the default hooks, which gives the existing behaviour of of directly sending and receiving multicast packets. This behaviour is registered when LVS is initialised. |
The proc entry /proc/net/vs/sync_msg_max_size was added to allow
the maximum size of messages sent by ipvs syncmaster to be
modified. In situations where the linux director is under load, this
will be the size of most synchronisation packets. The default is 1228 bytes
which reflects the old hard-coded value. The intention of this is to allow
more connections to be synchronised in a single packets on networks with an
MTU significantly larger than 1228 bytes, such as ethernet networks with
jumbo 6000 byte frames and netlink sockets which have an MTU of 64Kbytes.
Using the hooks that were added to the LVS synchronisation code, a method
that sends and receives synchronisation packets via a netlink socket was
written. This was implemented as a separate kernel module,
ip_vs_user_sync. When this module is inserted into the kernel it
registers itself as the synchronisation method and starts the kernel
synchronisation daemon ipvs syncmaster. The
send_mesg hook registered passes synchronisation packets to
user-space. The ip_vs_user_sync module listens for synchronisation
information from user-space and passes it directly to the LVS core. Thus
there is no need for an ipvs syncslave process and
ipvs syncmaster can be run on all linux directors. This
is important to allow a peer-to-peer relationship to be established between
linux directors in place of a master/slave relationship.
The ipvs syncmaster kernel daemon is started when
ip_vs_user_sync is initialised. Thus, unlike the existing LVS
synchronisation code, this daemon should not be started using ipvsadm.
The user-space daemon communicates with its kernel counterpart using an netlink
socket. This requires a small patch to the kernel to add the NETLINK_IPVS
protocol. Communication using this socket is somewhat analogous to UDP:
Packets of up to 64Kbytes may be sent. Packets may be dropped in situations
of high load, though unlike UDP this results in an error condition. Unlike
UDP the data in packets received can be assumed to be uncorrupted. Unless
there is broken memory in the machine, which should result in other, more
dire problems. Importantly only local processes owned by root may
communicate with the kernel using the netlink socket. Thus there is some
level of implicit authenticity of the user-space client.
Testing of the Netlink Socket as shown
in appendix a shows that there
is a sweet spot for throughput at a packet size of around 7200bytes.
For this reason it is suggested that the maximum size of
synchronisation packet sent by ipvs syncmaster be set to
this value using /proc/net/vs/sync_msg_max_size.
This is a library that provides a convenient way for user-space
applications to communicate with the kernel using the NETLINK_IPVS netlink
socket. This provides calls analogous to libc's send(), and recv() as well
as defining a simple packet structure which is also used by
ip_vs_user_sync.
A user-space synchronisation daemon. It uses libip_vs_user_sync to
access a netlink socket to listen for synchronisation information from the
kernel. It then sends this information to other linux directors using
multicast. The packet format used to send these packets has a version field
to allow packets from a different version of the protocol to easily be
identified and dropped. All nodes running this daemon can send and receive
this synchronisation information. Thus there is no sync-master/sync-slave
relationship.
This is intended as a bare-bones synchronisation daemon that illustrates how
libip_vs_user_sync can be used in conjunction with ip_vs_user_sync
to create a user-space synchronisation daemon. Its key advantage over the
existing LVS synchronisation code is that eliminates the
sync-master/sync-slave and the problems that can introduce. It is suitable
for use in both active/stand-by with two linux directors and active-active
configurations with any number of linux directors.
This daemon, like the existing LVS synchronisation code, has no security
protection. Thus interfaces listening for synchronisation packets over
multicast should be protected from packets from untrusted hosts. This can
be done by filtering at the gateway to the network, or using a private
network for synchronisation traffic. It is not sufficient to use packet
filtering on an exposed interface, as it is trivial for a would-be attacker
to spoof the source address of a packet.
ip_vs_user_sync
Netlink Socket
libip_vs_user_sync
ip_vs_user_sync_simple
Sample Configuration
|
| Sample Topology |
The topology has two linux directors, in an active/stand-by configuration. There is a Virtual IP address (VIP) on the external network which is the IP address that end-users connect to and should be advertised in DNS. There is also a VIP on the internal network. This is used as the default route for the real servers. The VIPs are administered by Heartbeat [11] so that they belong to the active linux director. ip_vs_user_sync_simple runs on both of the linux directors to synchronise connection information between them. NAT (Network Address Translation) is used to forward packets to the two real servers. More real servers may be added to the network if additional capacity is required.
Given the flexibility of LVS, there are many different ways of configuring
load balancing using LVS. This is one of them. Details on configuring LVS
can be found in the LVS HOWTO[12] and several sample
topologies can be found in the Ultra Monkey documentation[13]. The information in these documents, combined with
this paper should be sufficient to configure load balancing with connection
synchronisation for a wide range of network topologies.
The documentation that follows assumes that all nodes on the network are
set up with correct interfaces and routes for each network they are
connected to as per the diagram above. The return path for packets must be
through the active linux director. In most cases this will mean that
the the default route should be set to the VIP.
Packet Forwarding
The linux directors must be able to route traffic from the external
network to the server network and vice versa. Specifically, in addition to
correctly configuring the interfaces and routes you must enable IPV4
forwarding. This is done by modifying the line containing
net.ipv4.ip_forward in /etc/sysctl.conf as follows.
Alternatively the corresponding /proc value may be manipulated
directly.
Heartbeat
Heartbeat runs on the two linux directors and handles bringing up the
interface for the VIPs. To configure heartbeat
/etc/ha.d/ha.cf, /etc/ha.d/haresources and
/etc/ha.d/authkeys must be installed. The node names in
/etc/ha.d/ha.cf and /etc/ha.d/haresources must be set
according to the output of the uname -n command on each
linux director. The key in /etc/ha.d/authkeys, should be
modified to something confidential to the site. It is highly recommended
that heartbeat be run over at least two links to avoid a single
link-failure resulting in a fail-over. More information on configuring
these files can be found in the documentation an sample configuration files
supplied with Heartbeat.
To start heartbeat run: /etc/init.d/heartbeat start
After a few moments heartbeat should bring up an IP alias for the VIP
on the first master linux director. This can be verified using the
ifconfig command. The output of the following command has
been truncated to only show the eth0:0 and eth1:0
interfaces. Depending on the setup of the host it is possible that
heartbeat will use difference interfaces.
Ldirectord
The monitoring of real servers, and their insertion and
removal from the pool of servers available is controlled
by ldirectord. To configure ldirectord, /etc/ha.d/ldirectord.cf
must be installed. Information on customising this file can
be found in the ldirectord(8) man page and in
the sample configuration supplied with ldirectord.
To start ldirectord run: /etc/init.d/ldirectord start
Ldirectord should initialise the
the current LVS kernel table. To inspect this use ipvsadm.
For example:
Netfilter
As NAT is being used to forward packets to the real servers,
netfilter on linux director should be configured accordingly:
Synchronisation Daemon
The synchronisation daemon ip_vs_user_sync_simple is
configured by /etc/ip_vs_user_sync_simple.conf.
To start ip_vs_user_sync_simple run: /etc/init.d/ip_vs_user_sync_simple
Optionally, the the communication overhead between the kernel and
user-space synchronisation daemons, can be slightly reduced by increasing
the maximum packet size for packets sent by the kernel daemon from
1228bytes to 7200 bytes. To do this, add the following line to
/etc/sysctl.conf or by manipulating the corresponding /proc
entry directly.
The real servers should be configured to run the underlying services
for their respective virtual services. For instance, an HTTP daemon. In
addition the the "request" URLs as specified in
/etc/ha.d/ldirectord.cf should be present and contain the
"receive" string. The real servers also need to be set up so that
their default route is set to the VIP on the server network.
When an end-user make a connection to the external VIP, 192.168.7.240 this
should be received by the active linux director. The
linux director will then allocate the connection to one of the
real servers and forward packets to the real server for the life
of the connection. It will also synchronise the connection to the other
director using the synchronisation daemon. If the active
linux director fails then the stand-by linux director should
assume the VIP, then as this director has information about all the active
connections, from the connection synchronisation daemon, any active
connections will be able to continue.
Connection synchronisation between linux directors allows connections
to continue after a linux director fail-over occurs. This can occur
either if the active linux director fails or is taken down for
maintenance.
The method implemented establishes a peer-to-peer relationship between the
linux directors using multicast. This allows connections being handled
by any linux director to be efficiently synchronised to all other
available directors. Thus, no mater which director is active, if it becomes
unavailable and at least one other director is available, then fail-over
should occur and active connections should continue.
The implementation modifies the existing LVS synchronisation code to allow
different synchronisation methods to be registered. The method registered
forwards synchronisation information to a user space daemon, where it is
processed and distributed over multicast. Thus, much of the
synchronisation logic was moved out of the kernel and into user-space,
allowing a more sophisticated daemon to be built using existing user-space
libraries and debugging tools.
The performance of netlink socket communication between user-space and the
kernel is quite fast as shown in the charts below. The first chart plots
transfer speed against packet size for packets from 0 to 16Kbytes. The
second graph plots the same data for packets from 0 to 64Kbytes, the
maximum packet size for netlink sockets.
The graphs show that on a Pentium III 800MHz transfer rates of in
excess of 660000Kbytes/s (5.0Gbits/s) are attainable for packets over
3100bytes in size. The graphs also indicate that there is a sweet spot at
7200bytes. At this point a transfer rate of 920000Kbits/s (7.2Gbits/s) is
obtained.
Network Preparation
Linux Directors
net.ipv4.ip_forward = 1
For this change to take effect run: sysctl -p
/sbin/ifconfig
eth0:0 Link encap:Ethernet HWaddr 00:D0:B7:BE:6B:CF
inet addr:192.168.6.240 Bcast:192.168.6.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:17 Base address:0xef00
eth1:0 Link encap:Ethernet HWaddr 00:90:27:74:84:ED
inet addr:192.168.7.240 Bcast:192.168.7.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:18 Base address:0xee80
/sbin/ipvsadm -L -n
IP Virtual Server version 0.9.16 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.7.240:443 rr
-> 192.168.6.4:443 Masq 1 0 0
-> 192.168.6.5:443 Masq 1 0 0
# Flush existing rules
/sbin/iptables -F
# NAT for 192.168.6.0/24 bound for any host
/sbin/iptables -t nat -A POSTROUTING -j MASQUERADE -s 192.168.6.0/24
# Log all packets that attempt to be forwarded
# Useful for Debugging. Questionable for Production
#/sbin/iptables -t nat -A POSTROUTING -j LOG
# View the rules
# Truncated to only show only the POSTROUTING chain in the nat table
/etc/init.d/iptables status
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE all -- 192.168.6.0/24 anywhere
If an FTP Virtual Service is to be used then the ip_vs_ftp
kernel module needs to be used:
/sbin/modprobe ip_vs_ftp
net.ipv4.vs.sync_msg_max_size = 7200
For this change to take effect run: sysctl -p
Real Servers
Result
Conclusion
Appendix
Appendix A: Netlink Socket Throughput
|
[1] Wensong Zhang et al. 1998-2002. "Linux Virtual Server Project".
http://www.linuxvirtualserver.org/.
[2] F5 Networks, Inc. 1999-2002.
"F5 Networks".
http://www.f5.com/.
[3] F5 Networks. 2002. "F5 Networks".
http://www.f5networks.co.jp/.
[4] Foundry Networks, Inc. 2002.
"Foundry Networks".
http://www.foundrynetworks.com/.
[5] Foundry Networks. 2002.
"Foundry Networks".
http://www.foundrynetworks.co.jp/.
[6] Simon Horman. 2000.
"Creating Linux Web Farms".
http://www.vergenet.net/linux/.
[7] OSDN. 1997-2002.
"Slashdot.Org".
http://slashdot.org/.
[8] Internet Initiative Japan Inc. 1996.
"Internet Initiative Japan (IIJ)".
http://www.iij.ad.jp/.
[9] Dr. Ing.h.c.F.Porsch AG. 2002.
"Porsche.Com".
http://www.porsche.com/.
[10] Simon Horman. 2002.
"Saru: Active-Active Load Balancing".
http://www.ultramonkey.org/.
[11] Alan Robertson et al. 1999-2002.
"Heartbeat".
http://www.linux-ha.org/heartbeat/.
[12] Joseph Mack. 1999-2003.
"LVS-HOWTO".
http://www.linuxvirtaulserver.org/.
[13] Simon Horman. 2002.
"Ultra Monkey -
High Availability and Load Balancing Solution for Linux, 2.0.0.
http://www.ultramonkey.org/.