Linux Bridges, care and feeding.

The other day we hit an interesting "feature" in the linux bridge implementation.

On one of our KVM hosts, starting or stopping a virtual machine would take the host off the network, by changing its MAC address. Service was restored only by restarting the virtual machine, which would change the MAC back, or by waiting for arp caches to expire and refresh (which takes 2 hours on our firewall).

IP address conflict? No.

It was down to how the linux bridge code picks its MAC address, and how we configure our systems.

Why do we do it this way?

We have a moderately complex networked environment at work. We have bridges, VLANS and bonding all at once.

We have had problems in the past with the Redhat startup scripts not being able to assemble things in the right order for example (which is another conversation).

The standard way in which you use linux bridges with KVM virtualisation has been inherited from the way Xen did it, which was to rename the physical device and swap MAC addresses around.

i.e. if you have eth0 (MAC address A) and br0 you end up with

peth0 (FE:FF:..) -> br0 -> eth0 (A)

Where eth0 is a VETH device

The bridge looks like this

# brctl show
bridge name    bridge id            STP enabled    interfaces
br0            8000.0e26f1d2a5b7    no             peth0
                                                   vif0.0

Where the vif0.0 is the other end of veth0, which gets renamed to eth0.

Thats fine, but what if you have bond0 rather than eth0? The problem being that bond0 would take its MAC address from the ethernet cards which were its slaves, and ignore any MAC address you would try and set on it using something like " ip link set bond0 address XX:XX:XX:XX:XX:XX"

The result would be that you'd end up with bond0's mac address both on "pbond0" - the actual bond device, and on the VETH device "bond0" that the network-bridge scripts created. Same MAC on two sides of a bridge, cue confusion in the bridge's pointy head.

You'll get lots of these messages

 received packet with own address as source address

in your /var/log/messages.

So, in short, we dumped the VETH, went to the even older method of just assigning an IP address to the bridge, which worked very nicely until we got the new machines.

And the problem is?

So, our setup looks like this.

eth8 -\
        bond101 -> br101 
eth9 -/

Where bond101 and br101 look like this;

bond101   Link encap:Ethernet  HWaddr 00:2B:2B:16:7A:DA  
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
   
br101     Link encap:Ethernet  HWaddr 00:2B:2B:16:7A:DA  
          inet addr:10.82.101.206  Bcast:10.82.101.255  Mask:255.255.255.0

The host can talk to the network and the VMs can be added, removed, migrated etc..

The problem comes from the MAC address on the bridge - in the example above;

HWaddr 00:2B:2B:16:7A:DA

Now the linux bridge will take the lowest numbered mac address of all of the connected interfaces to be its own. A bridge with no connected interfaces looks like this;

bond0     Link encap:Ethernet  HWaddr 00:00:00:00:00:00  
          BROADCAST MASTER MULTICAST  MTU:1500  Metric:1

If we add an interface with mac address 02:1B:21:6F:8D:00, then that will become the MAC address of the bridge.

If we add an interface with mac address 00:1B:21:6F:8D:00, then that will become the MAC address of the bridge and the MAC address for any configured IP addresses on that bridge will change (because 00: is less than 02:)

Those new machines

Getting back to those troublesome new machines - they differ in that they have a broadcom chipset, not intel, so their MAC addresses look like this; 84:2b:2b:16:7a:f9 (broadcom), as opposed to 00:1B:21:6F:8D:00 (intel).

So, the bridge takes the MAC address 84:2b:2b:16:7a:f9. As you add virtual machines, you've now got a 50:50 chance that the randomly generated MAC address of the vif will be less than that of the primary physical interface - as 84: is just over half way from 00: to FF: and the MAC addresses for these broadcom chips all start 84:.

If we add some virtual machines - they'll have host side interfaces with randomly generated MACs like this;

18: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 500
    link/ether 82:f4:38:43:e7:2e brd ff:ff:ff:ff:ff:ff 
19: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 500
    link/ether 56:d4:1d:d3:1a:37 brd ff:ff:ff:ff:ff:ff

in this case, as soon as you start up the machine with vnet2 as its host side interface your MAC address on the bridge will change.

Lets watch it.

#brctl show
bridge name    bridge id            STP enabled    interfaces
br101          8000.842b2b167af9    no             bond101

Starting State
#ip link show br101
36: br101: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
    link/ether 84:2b:2b:16:7a:f9 brd ff:ff:ff:ff:ff:ff
# ip link show bond101
13: bond101: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue 
    link/ether 84:2b:2b:16:7a:f9 brd ff:ff:ff:ff:ff:ff

Add vnet0
# ip link show vnet0
18: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 500
    link/ether 82:f4:38:43:e7:2e brd ff:ff:ff:ff:ff:ff
# brctl addif br101 vnet0
# ip link show br101
36: br101: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
    link/ether 82:f4:38:43:e7:2e brd ff:ff:ff:ff:ff:ff

Add vnet1
# ip link show vnet1
19: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast qlen 500
    link/ether 56:d4:1d:d3:1a:37 brd ff:ff:ff:ff:ff:ff
# brctl addif br101 vnet1
# ip link show br101
36: br101: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue 
    link/ether 56:d4:1d:d3:1a:37 brd ff:ff:ff:ff:ff:ff

# ifconfig br101
br101     Link encap:Ethernet  HWaddr 56:D4:1D:D3:1A:37  
          inet addr:10.82.101.206  Bcast:10.82.101.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

If we remove vnet1, then in the immortal words of the Haynes manuals (cars, not underwear) removal is the reverse of fitting.

So we end up with the MAC address of the box bouncing all over the place, and from the point of view of that firewall with the long arp time out, falling off the network.

Fixes;

None of these are really appealing.

We could go back to using a VETH device for the host, and using the bond MAC address, and stop putting an IP address on the bridge. We then would need to assign a random MAC to the bond itself (which now seems to work).
We can create a dummy interface and give it a low numbered MAC address (e.g. starting 00:00) and then connect that to the bridge.
We can set the MAC address of the primary interface to start 00: so it always is the lowest numbered MAC address.

We're trying 3 as its the least disruptive. 1 is probably the "proper" way to do it however.

In the long run I'm hoping that macvlans will save us from the bridge code. There are other reasons for us to dump the bridge method of connecting virtual machines to the network mainly around how multicast works (or more to the point doesn't because the bridge doesn't do IGMP) on the linux software bridge.

Written by atp

Tuesday 17 May 2011 at 11:24 am

Posted in Linux

atp