20 May 2019 • Rédigé par Arnaud Bawol
As you know, we designed a multihoming point of peering to secure our nines and ensure a reliable service for our customers.
We previously saw that it's possible to do some ECMP Load sharing on BGP. There was several caveats to this design choice though.
The following is a run-down on the modifications we made to ensure the widest bandwidth on a reliable design.
Load-balancing!
The basic principle of the previously described architecture was to abstract transportation between two hosting providers' networks. We relied on BGP to route packets on this peering link and it seemed to be a good idea, but BGP isn't aware at all of the other routers load and doesn't implement any load balancing logic.
Quite often our traffic repartition was looking like this:
As you can see, load-sharing is not load-balancing.
BGP choose a channel to send its packets, and stuck to it forever. It can handle a network failure on both sides, but when it comes to network performance, you are going to feel left behind.
Let's talk about solutions.
Since we already had a 4-HOP interconnection, it didn't hurt to shake its components a bit. We first chose pfsense for its simplicity, but mostly for the incredible performances of packet-filter.
Despite Pfsense's ability to be highly available and highly reliable, it lacks a key feature for the design we had in mind. Indeed, it's able to handle load-balancing on an ingress link, but egress is another topic. As a matter of fact, it appeared impossible to do roundrobin on an uplink with Pfsense.
Since we were already using a bsd-based distro, why not check FreeBSD? It also has CARP redundancy, arp proxying and packet-filter. The lack of UI offers the possibility to use all the non-essential features that PFSense left aside.
To those who already know what feature I'm talking about, here is the configuration that we used:
1 set limit { states 1601000, frags 20000, src-nodes 1601000, table-entries 400000}
2
3 lan_net = "private_range_from_our_network/8"
4 interlan_net = "foo.bar.baz.0/24"
5 remote_net = "sub_private_range_from_our_network/24"
6
7 batch = "our.private.public.address/32"
8
9 int_if = "igb1"
10 ext_if = "igb0"
11
12 vpn_1 = "foo.bar.baz.2"
13 vpn_2 = "foo.bar.baz.3"
14 vpn_3 = "foo.bar.baz.4"
15 vpn_4 = "foo.bar.baz.5"
16
17 gateways = "$vpn_1, $vpn_2, $vpn_3, $vpn_4"
18
19 set block-policy dropset loginterface egress
20 set skip on lo0
21
22 pass in on $int_if from any to any
23 pass out on $int_if from any to any
24 pass out on $int_if from $lan_net
25 pass out on $int_if from $interlan_net
26 pass in on $int_if from $remote_net
27
28 block in on $ext_if from { any , ! $batch }
29
30 pass in quick on $int_if route-to \
31 { ($int_if $vpn_1), ($int_if $vpn_2), ($int_if $vpn_3), ($int_if $vpn_4) } \
32 round-robin from $lan_net to $remote_net keep state
33
34 pass inet proto icmp icmp-type echoreq
35 pass out on $ext_if proto { tcp, udp, icmp }
As you might guess, the most important line here is:
1 pass in quick on $int_if route-to \
2 { ($int_if $vpn_1), ($int_if $vpn_2), ($int_if $vpn_3), ($int_if $vpn_4) } \
3 round-robin from $lan_net to $remote_net keep state
Since some BSD distros have their own pf
port, it's important to keep in mind that this is the FreeBSD syntax, but the feature also exists in OpenBSD AFAIK.
This way, packetfilter does load balancing by TCP session. So, when your connection is established, you will keep the same transport route until its end.
ARP
ARP
proxying is quite simple, you just have to follow FreeBSD's guidelines to add your custom ARP
entries to the table. We will see a bit later how to update the ARP
table when a router becomes unavailable.
This is where it gets interesting. Now that we have a router with load-balancing capabilities, we don't want to see it fail. Using CARP, it's relatively straightforward:
/etc/rc.conf
:
1 ifconfig_igb1="inet private_network netmask your_mask"
2 ifconfig_igb1_alias1="inet interlan_network_primary_address/netmask"
3 ifconfig_igb1_alias2="inet vhid an_id_to_apply pass mysupersecretpassword alias interlan_network_primary_address/netmask"
You can check FreeBSD's documentation if this feature is still blurry for you. With this configuration, routers will share a virtual IP address. If a router is seen as down, it will be discarded from the cluster and its peer will handle the rest.
The same config being applied on the backup router, you may want to check the advskew
option to ensure that there is some consistency on your failover.
There is now a routing failover mechanism, but we also need to load/unload our ARP entries from the table to enable/disable proxying.
FreeBSD has a device state change daemon called devd. This daemon enables you to run some commands on device state changes. It's configured via a simple config file in /etc/devd/ifconfig.conf
:
1 notify 10 {
2 match "system" "CARP";
3 match "subsystem" "your_vhid_id@igb1";
4 match "type" "BACKUP";
5 action "/usr/local/scripts/unload.sh $subsystem $type";
6 };
7 notify 10 {
8 match "system" "CARP";
9 match "subsystem" "your_vhid_id@igb1";
10 match "type" "MASTER";
11 action "/usr/local/scripts/load.sh $subsystem $type";
12 };
Those action
scripts are called on BACKUP
and MASTER
type events, there are a few others that are well documented.
Since we are able to "hot swap" our routers, we also need to synchronize states. We previously saw that keep state
was used in our round-robin
pf
rule. Those states are kept locally, but we can synchronize them on the backup router via pfsync which is note worthy of a whole section.
Well, it looks like this:
We kept using BGP,as previously described, between our VPN servers, to ensure failover on a VPN Link. Now, our routers are available of their uplink availability, our loadbalancers are able to spread packets around; TCP
is able to handle sequences re-ordering in sessions.
Yes, it is.
So far, we've peaked for several hours @800Mbps+ without breaking a sweat, latency is not impacted by the addition of a hop since the RTT mostly depends on the distance between our two DCs. Our uplinks are properly used, there is no more ARP flapping between hosts and we have a rather stable latency between our datacenters.
Fresh news on modern CRM in your inbox !