Horms (Simon Horman) -
horms@valinux.co.jp |
:
/sa-ru/ (n): [1] Monkey in Japanese.
[2] Active-Active Support for Ultra Monkey.
Load Balancing using Layer 4 Switching, as provided by The Linux Virtual Server Project (LVS)[1], F5 Networks[2][3] Foundry Networks[4][5] and others is a powerful tool that allows networked services to scale beyond a single machine[6]. This is the prevailing technology for building large web sites from small, commodity servers. Many sites use this technology in one form or another. Examples include Slashdot.Org[7] (Foundry), Internet Initiative Japan (IIJ)[8] (F5) and Porsche.Com[9] (LVS).
Layer 4 Switching can, however, introduce a single point of failure into a system - all traffic must go through the linux director when it enters the network, and often when it leaves the network. Thus if the linux director fails then the load balanced site becomes unavailable. To alleviate this problem it is common to have two linux directors in an active/stand-by configuration.
Figure 1: One Linux Director, Single Point of Failure
| |
![]()
Figure2: Active/Stand-By Linux Directors
| |
In this scenario the active linux director accepts packets for the service and forwards them to the real servers. The stand-by linux director is idle. If the active linux director fails or is taken off-line for maintenance, then the stand-by becomes the active linux director. If the linux directors do not fail or get taken down for maintenance very often, then the vast majority of the time one of the linux directors is idle. This is arguably a waste or resources.
While the real servers can be horizontally scaled by adding more real servers. There is no easy way to increase the capacity of the linux director beyond its hardware capabilities.
Active-Active provides horizontal scalability for the available linux directors by having all of the linux directors active. This allows the aggregate resources of the linux directors to be used to load balance traffic on the network. Thus, using otherwise idle resources to provide additional capacity.
The goal of this project is to allow between one and approximately sixteen load balancers to act as the linux director for a service simultaneously. Part of this criteria is that adding additional linux directors should give a near linear increase in the net load balancing capacity of the network. Given that a single linux director running LVS on the Linux 2.4 kernel can direct in excess of 700Mbits/s of traffic, the cost of obtaining the resources to test the combined load balancing capacity of a number of active-active linux directors is somewhat prohibitive. For this reason the implementation and capacity testing of this project will focus on one to three linux directors.
The design should be such that if one or more of the linux directors fails then load balancing can continue using the remaining linux directors. This should be the case for both connections which are initiated after a given linux director fails and using connection synchronisation[10], existing connections that were being load balanced by the failed linux director.
MAC addresses are 48bit integer values[12], usually represented as six colon delimited octets in hexadecimal. A MAC address is used by Ethernet hardware - usually NICs and Switches - to address each other. When a host wants to sent a packet to another host on the network it resolves the IP address of the host to its MAC address using The Address Resolution Protocol (ARP)[13]. The host then uses this MAC address to send the ethernet frame that encapsulates the packet to the host.
Figure 3: Resolving the MAC Address of a Node Using ARP
| |
MAC addresses are preassigned to ethernet equipment by their manufacturer. Each manufacturer has a designated range of MAC addresses. Thus, each IP address in use on the network is present on a single host, and each MAC address is present on a single host. Thus when an ARP request is sent for the MAC address of an IP address that is present on the network, a single reply is received from the host that owns that IP address, with its MAC address.
Most ethernet hardware allows the MAC address to be changed, so it is possible to configure the NICs of multiple hosts to have the same MAC address. If multiple hosts have the same MAC and IP - this will be referred to as the common MAC and IP address respectively - then an ARP request for the common IP address will result in multiple ARP responses, all with the common MAC address. Although each ARP response will replace the previous one, they will all be the same so this is not important. Ethernet frames will be sent to the common MAC address. These frames should be received by all of the hosts on the network that have the common MAC address. Thus, the packets being sent will be received by all of the hosts with the common MAC address.
Figure 4: ARP on a Network with Hosts with a Common MAC and IP Address
| |
Although packets are received by every host with the common MAC address, it is undesirable for them to be processed by more than one host - one connection should be terminated by one host. Hubert[11] suggests using static iptables[15] rules to drop traffic based on the source address. For example, if there are two linux directors, Director-A and Director-B. Then Director-A should accept packets from hosts with an even IP address, and Director-B should accept packets from hosts with an odd IP address. This works well and can be scaled to any reasonable number of hosts by dividing up the source address space accordingly. However, this is not dynamic, and if one of the hosts fails then some end-users may not be able to connect.
Another problem with this approach is that it assumes that all source addresses are equal. However, it is quite reasonable to expect that some source addresses may make more requests than others. An example where this may occur are hosts that NAT or proxy large networks. Depending on who is accessing the service in question, this could result in an uneven distribution of load. Clearly, as Hubert himself suggests there is room for improvement.
Giving NICs on different hosts the same MAC and IP address is a trivial
task that can be done using the ifconfig or ip
commands. Examples can be found in
Putting It Together: A Common MAC and IP Address.
In an unswitched network this should work fine as the ethernet is a
broadcast medium, so the NICs on every host on the network will see every
frame sent. The NICs on the hosts with the common MAC address will
dutifully accept packets addressed to them. However, in a switched
environment things are a little bit more difficult.
The switches used during testing stored
the MAC addresses being used to send packets by hosts connected
to each port. Subsequent packets sent to one of these
MAC addresses are only sent to the associated port. If each host has
a unique MAC addresses - as they typically do - then this works quite well.
But in a situation where a common MAC address is used by several
hosts the result is that only one host will receive packets sent
to the common MAC address. Clearly this is a problem in implementing
the design presented here.
A simple solution to this problem is to use a bogus MAC address
as the source when sending packets. This prevents the switch
from associating the common MAC address with any particular port.
And when the switches used in testing didn't know which port a MAC
was associated with it would send the frame to all ports. Thus all
of the hosts with the common MAC address receive the frame. This has
the slight disadvantage that the switch behaves much like a hub for
packets addressed to the common MAC address. But that is largely
unavoidable.
Implementing this solution turned out to be a straight
forward patch to eth.c in the Linux kernel
to mangle the source MAC address. This behaviour can be
configured through the /proc
file system.
To globally enable this behaviour set
/proc/sys/net/ipv4/conf/all/outgoing_mac to 0.
Any non-zero value enables the feature.
The MAC address may be set on a per-interface basis
by modifying /proc/sys/net/ipv4/conf/<interface>/outgoing_mac.
A value of 0 disables the behaviour while any non-zero value
will be used as the MAC address for frames sent from the
corresponding interface.
Packets are always received using the MAC address of the NIC.
And if the outgoing_mac behaviour is disabled then
the address of the NIC is used to send packets. This
is the normal behaviour for an interface.
If the first bit of a MAC address is zero, then this is an individual
address. If the first bit is 1, then it is a group address. Group
addresses are intended for use with multicast and broadcast. It was thought
that group addresses should be used for this project. However,
RFC1122[14] specifies various restrictions on how
packets with a multicast source or destination MAC address should be used
that make their use for unicast IP traffic impractical. More specifically:
Common MAC and IP Address: outgoing_mac
Common MAC and IP Address: Group Mac Addresses
| Section 3.2.2 | "An ICMP Message MUST NOT be sent as the result of receiving... ...a datagram sent as a link-layer broadcast..." |
| Section 3.3.6 |
"When a host sends a datagram to a link-layer broadcast address, the IP
destination address MUST be a legal IP broadcast or IP multicast address."
"A host SHOULD silently discard a datagram that is received via a link-layer broadcast (see Section 2.4) but does not specify an IP multicast or broadcast destination address." |
Simple filtering, such as described by Hubert[11] can be set up using a simple ipchains rule on each host. However, the static nature of this approach is highly undesirable. For this reason a method of dynamically filtering traffic is used.
The heart of the filtering is a a netfilter[15] kernel module, ipt_saru. While most netfilter modules work such that the packets that they match are configured statically using iptables. The ipt_saru module is set as an netfilter match and initialised to a sane default setting by writing a module for iptables, libipt_saru. ipt_saru then allows the packets that it matches to be configured using a setsockopt from user-space. A daemon saru monitors the status of other linux directors using heartbeat and sets the configuration of ipt_saru on the fly.
The block diagram shows how these components fit together. A detailed explanation follows.
The current filtering implementation is somewhat similar. It allows incoming connections to be allocated to a host based on its source or destination port or IP address.
It is thought by the author that filtering on the source port, offers the most even load balancing. Connections for IP based services generally have an ephemeral port as the source port. While the exact selection of these ports varies between different operating systems, there is a large range of them, at least several thousand, often tens of thousands[16]. Given the size of this range, it seems reasonable to expect that connections from each end-user's machine will come in from many different ports.
However, such a simple scheme has its limitations and better methods
could be developed. This can be achieved by enhancing ipt_saru.
The number of individual matches for the incoming packet filter is likely
to be quite large. In the case of using the source or destination port,
there are 2^16 (~65,000) possible ports. In the case of using the
source or destination IP address there are 2^32 (~4,300,000,000) for
IPv4. More complex filtering criteria are likely to have even larger
numbers of possibilities. Clearly, it is not practical to have an
individual filtering rule for each possibility. To alleviate this problem,
a blocking scheme is used.
The result space is divided into a fixed number of blocks. For
example, if the possible ports are divided into 512 blocks, then each
block contains 128 ports: 0-127,128-255,...,65408-65545.
Filter: Blocking Strategies
Figure 7: Port Blocks
| |
For IP addresses the result space is divided up by using the modulus of the least-significant 16 bits, thus there are 2^16 sub-blocks of 32 addresses per block.
Figure 8: IP Address Blocks
| |
Blocks are allocated to each linux director using a modulus - if there are two linux directors then each linux director gets every second block. Using this scheme, if some areas of the filter space are not used as heavily as others, each linux director should all have some blocks in both the quieter and busier areas. An adaptive scheme that allocates blocks according to the relative load on each linux director may be developed in the future.
Conveniently these block allocations can be represented as a bit-map. For example, using ports again. If there are two linux directors and one is allocated the blocks 0-127,256-383,...,65280-65407 and the other the remaining blocks. Then the bit-map for this in hexadecimal would be 0xAA...A and 0x55...5 respectively.
Although the examples above focus on ports and IP addresses, this should be
extensible to the results of any selection method whose result can be
expressed as an integer in a known range. Given that the result needs to
be generated for each incoming packet, it is envisaged that all selection
methods will be simple, iterative processes that generate an integral
result.
To allow blocks to be allocated to the linux directors that are
available it is important to know the state of each linux director and
to be able to allocate blocks accordingly. To do this a helper application,
saru, was written for Heartbeat[17].
Heartbeat implements a heartbeat protocol. That is messages are sent at
regular intervals between host and if a message is not received from a
particular host then the the host is marked as being down. Heartbeat can
be configured to manage resources - for instance a virtual IP address -
according to this information. Heartbeat is able to monitor both other
hosts running Heartbeat and any device that can respond to ping requests.
The latter can be used to establish a simple quorum device. For instance,
if the heartbeat hosts are configured to monitor the IP address of a switch
they are connected to, or a router, then they can use this information to
decide whether they can take over a resource. It may be desirable to
configure things such that if a host is unable to access the switch on the
external network, then it will not take over any resources. That is, if
the quorum device is unavailable, then that host is not eligible to hold
any resources.
Filter: Linux Director Status
Figure 9: The Saru Daemon
| |
Unfortunately, heartbeat's resource manager - the logic that decides when
to take over or relinquish resources based on information on the status of
monitored nodes - is somewhat limited and does not work at all for more
than two nodes. For this reason when using heartbeat for this project it
should not manage any resources. Rather, saru should receive
messages from Heartbeat about the status of nodes and makes its own
decisions about what resources it owns. In a nutshell, saru
should use information from heartbeat to allocate the blocks in the result
space to the active linux directors and use this information to configure
ipt_saru locally.
The Node State is used to determine if a node is participating
in the cluster and thus participates in elections and has blocks
allocated to it.
Filter: Node State
Figure 10: Node States
| |
| Node States | |
| S_NODE_UNKNOWN | This is an error state. A node may move itself to the S_NODE_LEAVE or S_NODE_JOIN state from this state by sending a SARU_M_NODE_STATE_NOTIFY message. |
| S_NODE_LEAVE |
The node is not able to participate in the cluster. This occurs
when any of the configured quorum devices are unavailable.
When all the configured quorum devices are available a node should move from this state to the S_NODE_JOIN state by sending a SARU_M_NODE_STATE_NOTIFY message. |
| S_NODE_JOIN |
The node is able to participate in the cluster. This occurs
when all of the configured quorum devices are unavailable.
When any the configured quorum devices are become unavailable a node should move from this state to the S_NODE_LEAVE state by sending a SARU_M_NODE_STATE_NOTIFY message. |
| Node State Messages | |
| M_NODE_STATE_NOTIFY | Sent by a any node to any other node or all nodes to advise its state or in response to a M_NODE_STATE_QUERY message. Should include the state, S_NODE_LEAVE or S_NODE_JOIN. |
| M_NODE_STATE_QUERY | Sent by master node to a slave node or all nodes in the cluster to determine their state. Nodes should reply with a M_NODE_STATE_NOTIFY message. |
If saru running on each linux director divides up the blocks in the results space, then it is possible that an inconsistent state may result between linux directors. This may result in some blocks being allocated to more than one linux director or some blocks being allocated to no linux director. Both of these situations are highly undesirable. For this reason it is thought that it is good to have a master node in the cluster.
Note that saru having a master and slave nodes is only a convenience for saru internally. The linux directors are still working in an active-active configuration to load balance network traffic.
When the cluster of linux directors starts up for the first time or the current master leaves the cluster, a master needs to be elected. When any node joins the cluster it needs to find out which nodes is the master so that it can request blocks from it. The following states and messages are used to allow a master to be elected and queried. The messages are sent via Heartbeat.
Figure 11: Master Election States
| |
| Master Election States | |
| S_MASTER_UNKNOWN | This is an error state. A node may move itself to the S_MASTER_NONE state from this state. |
| S_MASTER_NONE |
The node does not know who the master is. This should be the state of a
node if it is not a member of the cluster. That is, the node should move to
this state from any other state if it is not a member of the cluster.
A node may move to the S_MASTER_QUERY state from this state by sending a M_MASTER_QUERY message. |
| S_MASTER_QUERY |
The node does not know who the master is, has sent an M_BLOCK_QUERY message
and is waiting for a reply.
If the node receives a M_MASTER_NOTIFY message then it should move to the S_MASTER_SET state. After an internally defined timeout or if a M_MASTER_NOMINATION message is received, the node should send a M_MASTER_NOMINATION message and go to the S_MASTER_ELECTION state. |
| S_MASTER_ELECTION |
The node does not know who the master is and has sent a M_MASTER_NOMINATION
message. A node should move to this state from any other state by sending
a M_MASTER_NOMINATION message if it receives a M_MASTER_NOMINATION message.
On receipt of a M_MASTER_NOTIFY message, the node should tally the election results, obtained from information in any M_MASTER_ELECTION messages received and calculate a winner. If the master in the M_MASTER_NOTIFY message matches the winner calculated then, this winner should be announced in a M_MASTER_NOTIFY message and the node should move to the S_MASTER_SET state. Otherwise the node should send a M_MASTER_NOMINATION and reenter the S_MASTER_ELECTION state. After an internally defined timeout the node should calculate a winner and announce this winner by in a M_MASTER_NOTIFY message and move to the S_MASTER_SET state. |
| S_MASTER_SET |
The node knows who the master is.
The node should send a M_MASTER_RETIRE if it is the master and is going to the S_MASTER_NONE state. If the node is the master, then on receipt of a M_MASTER_QUERY it should respond with a M_MASTER_NOTIFY message. Nodes may asynchronously send M_MASTER_NOTIFY messages. On receipt of a M_MASTER_NOTIFY message that does not match the current master, the node should send a M_MASTER_NOMINATION and go into the S_MASTER_ELECTION state. |
| Note: A node is regarded to have left the cluster if any of the available quorum devices are inactive. | |
| Master Election Messages | |
| M_MASTER_QUERY | Sent by a node to find out who the master node is. The master node should reply with a M_MASTER_NOTIFY message. |
| M_MASTER_NOTIFY | Sent by the master node to a slave node or all nodes in the cluster in response to a M_MASTER_QUERY message. Also sent to all nodes in the cluster by any node when a new master is elected. Should include the name of the master node. |
| M_MASTER_NOMINATION | Sent by a node when entering the S_MASTER_ELECTION state. Includes a metric, presumably the generation of the node, that is used to determine who wins the election. This metric should be regenerated by each node, each election. |
| M_MASTER_RETIRE | Sent by the master node if it stops being the master. |
| Note: All messages include the name of the node that they were sent from. This may be used to determine which node to reply to. | |
Heartbeat associates a node name with each node in the cluster. However, this is somewhat inefficient and by using an identity number in the range from 0 to 255 a bitmap can be used to allocate blocks to the nodes. The id of 0 is reserved to mark nodes whose identity number is unknown. Thus, the effective range of identity numbers is from 1 to 255.
The master node allocates an identity number to each node. A slave node requests its identity number from the master node when it joins the cluster. It also requests its identity number when a the master node changes. To avoid the identity number of a given node in the cluster needlessly changing, all nodes should listen to all identity number notifications and remember the association between node names and identity numbers. In this way, if a slave node subsequently becomes the master node, it can use this information as a base for the identity numbers it will issue.
Figure 12: Identity Management States
| |
| Identity Management States | |
| S_US_UNKNOWN | This is an error state. A node may move itself to the S_US_NONE state from this state. |
| S_US_NONE |
The node does not know its own ID number.
A node should move to this state from any other state if it
is not a member of the cluster or if the master node changes.
A node may move to the S_US_QUERY state from this state by sending a M_US_QUERY message. |
| S_US_QUERY | The node does not know its own ID number, has sent a M_US_QUERY message and is awaiting a reply. After an internally defined timeout the node should move to the S_US_NONE state. |
| S_US_SET |
The node knows its own ID number. A node may move to this state from any other
state if it receives a M_US_NOTIFY message and sets its ID number
accordingly.
A node may move to this state from any other state if it is the master node. It should allocate id numbers to all nodes in the cluster. |
| Identity Management Messages | |
| M_US_QUERY | Sent by a slave node to the master node to find its identity number. If this message is received by a slave node it should be ignored. |
| M_US_NOTIFY | Sent by a master node to a any other node or all nodes to associate an identity number with a node name. This should be the response to a M_US_QUERY. All nodes that receive this message should update their table of node names and identity numbers accordingly. This message should include the node's id number and the name. |
| Note: All messages include the name of the node that they were sent from. This may be used to determine which node to reply to. | |
The master node allocates blocks to all the linux directors. Slave nodes may request blocks from the master. These requests can be made via heartbeat. The following states and messages are used to allow blocks to be allocated by the master.
Figure 13: Block Management States
| |
| Block Management States | |
| S_BLOCK_UNKNOWN | This is an error state. A node may move itself to the S_BLOCK_NONE state from this state. |
| S_BLOCK_NONE |
The node has no blocks allocated. This should be the state of a node while
it is not a member of the cluster. That is,
the node should move to this state from any other state if it
is not a member of the cluster.
A node may move to the S_BLOCK_QUERY state from this state by sending a M_BLOCK_QUERY message. |
| S_BLOCK_QUERY | The node has no blocks allocated, has sent a M_BLOCK_QUERY message and is awaiting a reply. After an internally defined timeout the node should move to the S_BLOCK_NONE state. |
| S_BLOCK_SET |
The node has blocks allocated. A node may move to this state from any other
state if it receives a M_BLOCK_GRANT message and sets its allocated blocks
accordingly.
A node may move to this state from any other state if it is the master node. It may send a M_BLOCK_QUERY message to all nodes in the cluster and wait for M_BLOCK_USED replies until an internally defined timeout. It should allocate the internal block space and send a M_BLOCK_GRANT to all nodes in the cluster. |
| Note: A node is regarded to have left the cluster if any of the available quorum devices are inactive. | |
| Block Management Messages | |
| M_BLOCK_GRANT | Sent by a master node to grant blocks. Includes a bitmap of the blocks being granted. May be sent to an individual slave node or all nodes in the cluster. May be sent in response to a M_BLOCK_REQUEST message or asynchronously. In either case all recipient nodes should respond with a M_BLOCK_ACCEPT message. |
| M_BLOCK_REQUEST | Sent by a slave node to the master node to request blocks. This should be done when the slave node joins the cluster. It may also be done to request additional blocks, which may or may not be granted. The master node should respond with a M_BLOCK_GRANT message. |
| M_BLOCK_ACCEPT | Sent by a slave node to the master node in response to a M_BLOCK_GRANT message. The master node may resend the M_BLOCK_GRANT message if this is node received within a internally defined timeout. |
| M_BLOCK_QUERY | Sent by any node to any other node or all nodes in the cluster to ask which block allocations that node is using. It is thought that nodes may have some blocks allocated that they are not using. These may be allocated to other nodes to allow more even load balancing. The node should respond with a M_BLOCK_USED message. |
| M_BLOCK_USED | Sent to the master node in response to a M_BLOCK_QUERY message. Includes a bitmap of the blocks being used. |
| Note: All messages include the name of the node that they were sent from. This may be used to determine which node to reply to. | |
It is reasonable to expect that users may want to have different clusters of linux directors and to have a single linux director be a member or more than one cluster. For this reason a cluster id is used to uniquely identify a cluster. An unsigned 16bit number is used. This allows ~65,000 unique clusters.
Given that the saru daemon changes the ipt_saru filtering
rules on the fly, according to which linux directors are active in the
cluster. It is reasonably to expect that the block that a given TCP
connection belongs to, may be reallocated to a different
linux director if linux directors are added or removed from the
cluster. Thus steps should be taken to avoid these connections breaking in
such circumstances.
One approach is to use Connection Synchronisation[10].
This synchronises information about active connections
between the linux directors in the cluster, and allows a connection
to continue even if its linux director changes. This is arguably
the best solution for linux director, as it allows the load of
existing and new connections to be distributed amongst the active
linux directors by saru.
However, if active-active is being used on real servers,
that is hosts that terminate a connection, then this solution is not so
attractive as most daemons that handle connections cannot deal
with the connection changing machine mid-stream.
In this situation an alternative approach that allows
established connections to continue being handled
by their existing real server is needed. This can
be achieved by using the Connection Tracking support
of netfilter to accept packets for established connections.
While packets for new connections are accepted or rejected by saru.
Unfortunately this solution can only work for TCP and
not UDP as the latter does not have sufficient information
in the protocol to distinguish between new and
established connections. In fact, strictly speaking
UDP does not have connections at all.
The following sample iptables rules can be used to explain
this technique in more depth.
Connection Management: Connection Synchronisation
Connection Management: Connection Tracking
| ||||||||||||||
| ||||||||||||||
| ||||||||||||||
On Linux, both the MAC and IP address of an interface can be set using either the ifconfig or ip command on Linux. The following configures a host's eth0 interface with the MAC address 00:50:56:14:03:40 and adds the IP address 192.168.20.40.
ip link set eth0 down ip link set eth0 address 00:50:56:14:03:40 ip link set eth0 up ip route add default via 172.16.0.254 ip addr add dev eth0 192.168.20.40/24 broadcast 255.255.255.0 |
The results can be verified by using ip addr sh eth0. For example:
2: eth0: <BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast qlen 100 link/ether 00:50:56:14:03:40 brd ff:ff:ff:ff:ff:ff inet 172.16.4.226/16 brd 172.16.255.255 scope global eth0 inet 192.168.20.40/24 brd 255.255.255.0 scope global eth0 |
Note that 192.168.20.40 was added to the interface. The interface already had another addresses. This is useful, as a linux director can have a unique IP address as well as the common IP address to allow access directly to a particular linux director.
If necessary, set a bogus MAC address to be used as the source MAC address when sending frames. This example sets 00:50:56:14:03:43 as the address to be used when sending frames from eth0.
echo 1 > /proc/sys/net/ipv4/conf/all/outgoing_mac echo 00:50:56:14:03:43 > /proc/sys/net/ipv4/conf/eth0/outgoing_mac |
Iptables uses a module libipt_saru to add rules such that packets will be vetted based on the packet match made by the ipt_saru netfilter module. The following iptables rules should configure a virtual IP address, 192.168.20.40, to only accept packets as dictated by ipt_saru and in turn the saru daemon. --id 1 indicates that the saru daemon monitoring cluster with id 1 should be used to manipulate the filter on the fly.
iptables -F
iptables -A INPUT -d 192.168.20.40 -p tcp \
-m saru --id 1 --sense src-port -j ACCEPT
iptables -A INPUT -d 192.168.20.40 -p udp \
-m saru --id 1 --sense src-port -j ACCEPT
iptables -A INPUT -d 192.168.20.40 -p icmp \
-m icmp --icmp-type echo-request \
-m saru --id 1 --sense src-addr -j ACCEPT
iptables -A INPUT -d 192.168.20.40 -p icmp \
-m icmp --icmp-type ! echo-request -j ACCEPT
iptables -A INPUT -d 192.168.20.40 -j DROP
|
If LVS-NAT is being used then the following rules are also required to prevent all the linux directors sending replies on behalf of the the real servers. The example assumes that the real servers are on the 192.168.6.0/24 network.
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -d 192.168.6.0/24 -j ACCEPT
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -m state --state INVALID \
-j DROP
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -m state --state ESTABLISHED \
-j ACCEPT
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -m state --state RELATED \
-j ACCEPT
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -p tcp -m state --state NEW \
--tcp-flags SYN,ACK,FIN,RST SYN -m saru --id 1 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -p udp -m state --state NEW \
-m saru --id 1 -j MASQUERADE
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -p icmp \
-m icmp --icmp-type echo-request -m state --state NEW \
-m saru --id 1 --sense dst-addr -j MASQUERADE
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -p icmp \
-m icmp --icmp-type ! echo-request -m state --state NEW \
-j MASQUERADE
iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -j DROP
|
Given that the saru daemon changes the ipt_saru filtering rules on the fly, according to which linux directors are active in the cluster. It is reasonably to expect that the block that the source port of a given TCP connection belongs to, may be reallocated to a different linux director during the life of the connection. For this reason, it is important that TCP connections be synchronised from each active linux director to all other active linux directors using Connection Synchronisation[10].
Heartbeat should be configured with no local resources and at least one ping node which will act as a quorum device. That is, the haresources file should be empty and the ha.cf file should include at least one ping directive, a node directive for each linux director and at least one transport. An example, minimal ha.cf which uses multicast for communication and has the nodes fred and mary in the cluster follows.
mcast eth0 225.0.0.7 694 1 1 ping 192.168.0.254 node fred node mary |
The saru daemon may be started manually, at boot time using it's init script, or by adding the following line to the ha.cf file.
respawn root /path/to/saru |
By default saru waits for a short time after starting before initialising its filter. This is to allow connection synchronisation information time to propagate. This and other options can be configured on the command line or by editing the saru.conf file. See the saru man page for details.
Once saru is up and running the state of the filtering rules can be inspected by examining /proc/net/saru_map. Below is an example showing that the filter rule for cluster id 1 has been set with a bitmap of all '5's. Please note that the bitmap has been truncated for formating reasons.
Id RefCount BitMap 0001 00000001 55555555555555... |
As discussed in Connection Management either Connection Synchronisation of Connection Tracking should be used in conjunction with saru to avoid connections breaking unnecessarily when linux directors are added or removed from the cluster.
Connections can be distributed between multiple active linux directors by assigning all of the linux directors a common MAC and IP address. This is a relatively straight forward process. This should work on both switched and non-switched ethernet networks. Behaviour in non-ethernet environments is beyond the scope of this project.
In order to prevent duplicate packets - which would probably result in TCP connection failure - it is important than only one of the active linux directors accept the packets for a given TCP connection. By utilising the Netfilter packet filtering infrastructure in the Linux 2.4 kernel and the monitoring capabilities of Heartbeat it is possible to build a system that dynamically updates what traffic is accepted by each linux director in a cluster. In this way it is possible to balance incoming traffic between multiple active linux directors.
This solution should scale to at least 16 nodes. Both heartbeat and the connection synchronisation daemon use multicast for communication. Thus, as the number of nodes increases, the amount of inter-linux director communication should increase linearly. It is thought that the real limitation will be either switch bandwidth, or the handling of packets, most of which will be dropped by any given node, by netfilter.
The design allows for fast migration of existing connections and resources to accept new connections in the case of fail-over. Something that I do not believe can be said for DNS based solutions.
I believe that this design provides the best possible active-active solution for linux directors. Interestingly this design could be used to load balance any linux machines, and could if appropriate, be used in place of Layer 4 Switching.
[1] Wensong Zhang et al. 1998-2003. "Linux Virtual Server Project".
http://www.linuxvirtualserver.org/.
[2] F5 Networks, Inc. 1999-2003.
"F5 Networks".
http://www.f5.com/.
[3] F5 Networks. 2003. "F5 Networks".
http://www.f5networks.co.jp/.
[4] Foundry Networks, Inc. 2003.
"Foundry Networks".
http://www.foundrynetworks.com/.
[5] Foundry Networks. 2003.
"Foundry Networks".
http://www.foundrynetworks.co.jp/.
[6] Simon Horman. 2000.
"Creating Linux Web Farms".
http://www.vergenet.net/linux/.
[7] OSDN. 1997-2003.
"Slashdot.Org".
http://slashdot.org/.
[8] Internet Initiative Japan Inc. 1996.
"Internet Initiative Japan (IIJ)".
http://www.iij.ad.jp/.
[9] Dr. Ing.h.c.F.Porsch AG. 2003.
"Porsche.Com".
http://www.porsche.com/.
[10] Simon Horman. 2002.
"LVS Tutorial".
http://www.ultramonkey.org/.
[11] Bert Hubert. 2001.
"How to do simple loadbalancing with Linux without a single point of
failure".
http://lartc.org/autoloadbalance.php3.
[12] IEEE. 2000.
IEEE Std 802.3,
"CSMA/CD Access Method and Physical Layer Specification",
2000 Edition. pp 39.
http://standards.ieee.org/
[13] David C. Plummer. November 1982.
RFC 826 "An Ethernet Address Resolution Protocol"
or "Converting Network Protocol Addresses to 48.bit Ethernet
Addresses for Transmission on Ethernet Hardware".
http://www.ietf.org/rfc/rfc826.txt
[14] Internet Engineering Task Force. Editor: R. Braden
RFC 1122"Requirements for Internet Hosts -- Communication Layers"
http://www.ietf.org/rfc/rfc1122.txt
[15] Netfilter Core Team. 2003.
"Netfilter - firewalling, NAT and packet mangling for Linux 2.4",
http://www.netfilter.org/
[16] W. Richard Stevens. 1998.
"Unix Network Programming", Volume 1, Second Edition.
pp 42-43.
Prentice Hall, Upper Saddle River, NJ, USA.
[17] Alan Robertson et al. 1999-2003.
"Heartbeat",
http://www.linux-ha.org/heartbeat/.