OpenStack

Layer 3 High Availability

L3 Agent Low Availability

Today, you can utilize multiple network nodes to achieve load sharing, but not high availability or redundancy. Assuming three network nodes, creation of new routers will be scheduled and distributed amongst those three nodes. However, if a node drops, all routers on that node will cease to exist as well as any traffic normally forwarded by those routers. Neutron, in the Icehouse release, doesn’t support any built-in solution.

A Detour to the DHCP Agent

DHCP agents are a different beast altogether – The DHCP protocol allows for the co-existence of multiple DHCP servers all serving the same pool, at the same time.

By changing:

neutron.conf:
dhcp_agents_per_network = X

You will change the DHCP scheduler to schedule X DHCP agents per network. So, for a deployment with 3 network nodes, and setting dhcp_agents_per_network to 2, every Neutron network will be served by 2 DHCP agents out of 3. How does this work?

dhcp_ha_topology

First, let’s take a look at the story from a baremetal perspective, outside of the cloud world. When the workstation is connected to a subnet in the 10.0.0.0/24 subnet, it broadcasts a DHCP discover. Both DHCP servers dnsmasq1 and dnsmasq2 (Or other implementations of a DHCP server) receive the broadcast and respond with an offer for 10.0.0.2. Assuming that the first server’s response was received by the workstation first, it will then broadcast a request for 10.0.0.2, and specify the IP address of dnsmasq1 – 10.0.0.253. Both servers receive the broadcast, but only dnsmasq1 responds with an ACK. Since all DHCP communication is via broadcasts, server 2 also receives the ACK, and can mark 10.0.0.2 as taken by AA:BB:CC:11:22:33, so as to not offer it to other workstations. To summarize, all communication between clients and servers is done via broadcasts and thus the state (What IPs are used at any given time, and by who) can be distributed across the servers correctly.

In the Neutron case, the assignment from MAC to IP is configured on each dnsmasq server beforehand, when the Neutron port is created. Thus, both dnsmasq leases file will hold the AA:BB:CC:11:22:33 to 10.0.0.2 mapping before the DHCP request is even broadcast. As you can see, DHCP HA is supported at the protocol level.

Back to the Lowly Available L3 Agent

L3 agents don’t (Currently) have any of these fancy tricks that DHCP offers, and yet the people demand high availability. So what are the people doing?

  • Pacemaker / Corosync – Use external clustering technologies to specify a standby network node for an active one. The standby node will essentially sit there looking pretty, and when a failure is detected with the active node, the L3 agent will be started on the standby node. The two nodes are configured with the same hostname so that when the secondary agent goes live and synchronizes with the server, it identifies itself with the same ID and thus manages the same routers.
  • Another type of solution writes a script that runs as a cron job. This would be a Python SDK script that would use the API to get a list of dead agents, get all the routers on that agent, and reschedule them to other agents.
  • In the Juno time frame, look for this patch https://review.openstack.org/#/c/110893/ by Kevin Benton to bake rescheduling into Neutron itself.

Rescheduling Routers Takes a Long, Long Time

All solutions listed suffer from a substantial failover time, if only for the simple fact that configuring a non-trivial amount of routers on the new node(s) takes quite a while. Thousands of routers take hours to finish the rescheduling and configuration process. The people demand fast failover!

Distributed Virtual Router

DVR has multiple documents explaining how it works:

The gist is that it moves routing to the compute nodes, rendering the L3 agent on the network nodes pointless. Or does it?

  • DVR handles only floating IPs, leaving SNAT to the L3 agents on the network nodes
  • Doesn’t work with VLANs, only works with tunnels and L2pop enabled
  • Requires external connectivity on every compute node
  • Generally speaking, is a significant departure from Havana or Icehouse Neutron based clouds, while L3 HA is a simpler change to make for your deployment

Ideally you would use DVR together with L3 HA. Floating IP traffic would be routed directly by your compute nodes, while SNAT traffic would go through the HA L3 agents on the network nodes.

Layer 3 High Availability

The Juno targeted L3 HA solution uses the popular Linux keepalived tool, which uses VRRP internally. First, then, let’s discuss VRRP.

What is VRRP, how does it work in the physical world?

Virtual Router Redundancy Protocol is a first hop redundancy protocol – It aims to provide high availability of the network’s default gateway, or the next hop of a route. What problem does it solve? In a network topology with two routers providing internet connectivity, you could assign half of the network’s default gateway to the first router’s IP address, and the other half to the second router.

router_ha_topology_before_vrrp

This would provide load sharing, but what happens if one router loses connectivity? Herein comes the idea of a virtual IP address, or a floating address, which will be configured as the network’s default gateway. During a failover, the standby routers won’t receive VRRP hello messages from the master and will thus perform an election process, with the winning router acting as the active gateway, and the others remain as standby. The active router configures the virtual IP address (Or VIP for short), on its internal, LAN facing interface, and responds to ARP requests with a virtual MAC address. The network computers already have entries in their ARP caches (For the VIP + virtual MAC address) and have no reason to resend an ARP request. Following the election process, the virtuous standby router becomes the new active instance, and sends a gratuitous ARP request – Proclaiming to the network that the VIP + MAC pair now belong to it. The switches comprising the network move the virtual MAC address from the old port to the new.

switch_moves_mac

 

By doing so, traffic to the default gateway will reach the correct (New) active router. Note that this approach does not accomplish load sharing, in the sense that all traffic is forwarded through the active router. (Note that in the Neutron use case, load sharing is not accomplished at the individual router level, but at the node level, assuming a non-trivial amount of routers). How does one accomplish load sharing at the router resolution? VRRP groups: The VRRP header includes a Virtual Router Identifier, or VRID. Half of the network hosts will configure the first VIP, and the other half the second. In the case of a failure, the VIP previously found on the failing router will transfer to another one.

router_ha_two_vrrp_groups

The observant reader will have identified a problem – What if the active router loses connectivity to the internet? Will it remain as the active router, unable to route packets? VRRP adds the capability to monitor the external link and relinquish its role as the active router in case of a failure.router_ha_external_trap

Note: As far as IP addressing goes, it’s possible to operate in two modes:

  1. Each router gets an IP address, regardless of its VRRP state. The master router is configured with the VIP as an additional  or secondary address.
  2. Only the VIP is configured. IE: The master router will hold the VIP while the slaves will have no IPs configured whatsoever.

VRRP – The Dry Facts

  • Encapsulated directly in the IP protocol
  • Active instance uses multicast address 224.0.0.18, MAC 01-00-5E-00-00-12 when sending hello messages to its standby routers
  • The virtual MAC address is of the form: 00-00-5E-00-01-{VRID}, thus only 256 different VRIDs (0 to 255) can exist in a single broadcast domain
  • The election process uses a user configurable priority, from 1 to 255, the higher the better
  • Preemptive elections, like in other network protocols, means that if a standby is configured with a higher priority, or comes back after losing its connectivity (And previously acting as the active instance) it will resume its role as the active router
  • Non-preemptive elections mean that when an active router loses its connectivity and comes back up, it will remain in a standby role
  • The hello internal is configurable (Say: Every T seconds), and standby routers perform an election process if they haven’t received a hello message from the master after 3T seconds

Back to Neutron-land

L3 HA starts a keepalived instance in every router namespace. The different router instances talk to one another via a dedicated HA network, one per tenant. This network is created under the blank tenant to hide it from the CLI and GUI. The HA network is a  Neutron tenant network, same as every other network, and uses the default segmentation technology. HA routers have an ‘HA’ device in their namespace: When a HA router is created, it is scheduled to a number of network nodes, along with a port per network node, belonging to the tenant’s HA network. keepalived traffic is forwarded through the HA device (As specified in the keepalived.conf file used by the keepalived instance in the router namespace). Here’s the output of ‘ip address’ in the router namespace:

[stack@vpn-6-88 ~]$ sudo ip netns exec qrouter-b30064f9-414e-4c98-ab42-646197c74020 ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default 
    ...
2794: ha-45249562-ec: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default 
    link/ether 12:34:56:78:2b:5d brd ff:ff:ff:ff:ff:ff
    inet 169.254.0.2/24 brd 169.254.0.255 scope global ha-54b92d86-4f
       valid_lft forever preferred_lft forever
    inet6 fe80::1034:56ff:fe78:2b5d/64 scope link 
       valid_lft forever preferred_lft forever
2795: qr-dc9d93c6-e2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default 
    link/ether ca:fe:de:ad:be:ef brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.1/24 scope global qr-0d51eced-0f
       valid_lft forever preferred_lft forever
    inet6 fe80::c8fe:deff:fead:beef/64 scope link 
       valid_lft forever preferred_lft forever
2796: qg-843de7e6-8f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default 
    link/ether ca:fe:de:ad:be:ef brd ff:ff:ff:ff:ff:ff
    inet 19.4.4.4/24 scope global qg-75688938-8d
       valid_lft forever preferred_lft forever
    inet6 fe80::c8fe:deff:fead:beef/64 scope link 
       valid_lft forever preferred_lft forever

That is the output for the master instance. The same router on another node would have no IP address on the ha, hr, or qg devices. It would have no floating IPs or routing entries. These are persisted as configuration values in keepalived.conf, and when keepalived detects the master instance failing, these addresses (Or: VIPs) are configured by keepalived on the appropriate devices. Here’s an example of keepalived.conf, for the same router shown above:

vrrp_sync_group VG_1 {
    group {
        VR_1
    }
    notify_backup "/path/to/notify_backup.sh"
    notify_master "/path/to/notify_master.sh"
    notify_fault "/path/to/notify_fault.sh"
}
vrrp_instance VR_1 {
    state BACKUP
    interface ha-45249562-ec
    virtual_router_id 1
    priority 50
    nopreempt
    advert_int 2
    track_interface {
        ha-45249562-ec
    }
    virtual_ipaddress {
        19.4.4.4/24 dev qg-843de7e6-8f
    }
    virtual_ipaddress_excluded {
        10.0.0.1/24 dev qr-dc9d93c6-e2
    }
    virtual_routes {
        0.0.0.0/0 via 19.4.4.1 dev qg-843de7e6-8f
    }
}

What are those notify scripts? These are scripts that keepalived executes upon transition to master, backup, or fault. Here’s the contents of the master script:

#!/usr/bin/env bash
neutron-ns-metadata-proxy --pid_file=/tmp/tmpp_6Lcx/tmpllLzNs/external/pids/b30064f9-414e-4c98-ab42-646197c74020/pid --metadata_proxy_socket=/tmp/tmpp_6Lcx/tmpllLzNs/metadata_proxy --router_id=b30064f9-414e-4c98-ab42-646197c74020 --state_path=/opt/openstack/neutron --metadata_port=9697 --debug --verbose
echo -n master > /tmp/tmpp_6Lcx/tmpllLzNs/ha_confs/b30064f9-414e-4c98-ab42-646197c74020/state

The master script simply opens up the metadata proxy, and writes the state to a state file, which can be later read by the L3 agent. The backup and fault scripts kill the proxy and write their respective states to the aforementioned state file. This means that the metadata proxy will be live only on the master router instance.

* Aren’t We Forgetting the Metadata Agent?

Simply enable the agent on every network node and you’re good to go.

Future Work & Limitations

  • TCP connection tracking – With the current implementation, TCP sessions are broken on failover. The idea is to use conntrackd in order to replicate the session states across HA routers, so that when the failover finishes, TCP sessions will continue where they left off.
  • Where is the master instance hosted? As it is now it is impossible for the admin to know which network node is hosting the master instance of a HA router. The plan is for the agents to report this information and for the server to expose it via the API.
  • Evacuating an agent – Ideally bringing down a node for maintenance should cause all of the HA router instances on said node to relinquish their master states, speeding up the failover process.
  • Notifying L2pop of VIP movements – Consider the IP/MAC of the router on a tenant network. Only the master instance will actually have the IP configured, but the same Neutron port and same MAC will show up on all participating network nodes. This might have adverse effects on the L2pop mechanism driver, as it expects a MAC address in a single location in the network. The plan to solve this deficiency is to send an RPC message from the agent whenever it detects a VRRP state change, so that when a router becomes the master, the controller is notified, which can then update the L2pop state.
  • FW, VPN and LB as a service integration. Both DVR and L3 HA have issues integrating with the advanced services, and a more serious look will be taken during the Kilo cycle.
  • One HA network per tenant. This implies a limit of 255 HA routers per tenant, as each router takes up a VRID, and the VRRP protocol allows 255 distinct VRID values in a single broadcast domain.

Usage & Configuration

neutron.conf:
l3_ha = True
max_l3_agents_per_router = 2
min_l3_agents_per_router = 2
  • l3_ha = True means that all router creations will default to HA (And not legacy) routers. This is turned off by default.
  • You can set the max to a number between min and the number of network nodes in your deployment. If you deploy 4 net nodes but set max to 2, only two l3 agents will be used per HA router (One master, one slave).
  • min is used as a sanity check: If you have two network nodes and one goes out momentarily, any new routers created during that time period will fail as you need at least <min> L3 agents up when creating a HA router.

l3_ha controls the default, while the CLI allows an admin (And only admins) to override that setting on a per router basis:

neutron router-create --ha=<True | False> router1

References

Standard
Talks

Vote for OpenStack Paris sessions!

The window to vote for the Paris summit sessions has opened. I’ve submitted two talks:

http://www.openstack.org/vote-paris/Presentation/neutron-is-impossible

Neutron is Impossible!

“In this talk, we’ll demystify Neutron by explaining the what and the how of each major component. Additionally, we’ll describe what happens when you boot a VM in a visual way – How Nova interacts with Neutron, and the role of each Neutron component. We’ll talk about how Neutron accomplishes networking segmentation via VLANs and overlay networks. Just what are VLANs and tunneling, and how do they both solve the same problem? Finally, we’ll talk about what happens on the compute and network nodes by examining the different Open vSwitch and Linux bridges, patch ports and veth pairs, namespaces and tap devices and hopefully understand how it all comes together.

Following the talk, operators and developers should have a solid understanding of Neutron’s more puzzling concepts.”

 

http://www.openstack.org/vote-paris/Presentation/neutron-network-node-high-availability

Neutron Network Node High Availability

I’ve been working with Sylvain Afchain of Enovance on layer 3 high availability throughout the Juno cycle, and we’ve submitted a talk about it:

“Today, you can configure multiple network nodes to host DHCP and L3 agents. In the Icehouse release, load sharing is accomplished by scheduling virtual routers on L3 agents. However, if a network node or L3 agent goes down, all routers scheduled on that node will go down and connectivity will be lost. The Juno release will introduce L3 high availability as well as distributed routing.

L3 high availability schedules routers on more than a single network node. Stand-by routers use keepalive (And the VRRP protocol internally) to monitor the active router and pick up the slack in case of a failure. We’ll explain what is VRRP and how will the feature affect the network node, as well as give a demonstration.

Distributed routing, or DVR, moves routing to the compute nodes, while keeping SNAT on the network nodes.

We’ll talk about DHCP HA, go into a conceptual overview of both L3 HA and DVR, give demonstrations, and explain how to integrate them together to get the best of both worlds.”

Standard
Uncategorized

The need for Network Overlays – part I

Originally posted on The Network Way - Nir Yechiel's blog:

The IT industry has gained significant efficiency and flexibility as a direct result of virtualization. Organizations are moving toward a virtual datacenter model, and flexibility, speed, scale and automation are central to their success. While compute, memory resources and operating systems were successfully virtualized in the last decade, primarily due to the x86 server architecture, networks and network services have not kept pace.

The traditional solution: VLANs

Way before the era of server virtualization, Virtual LANs (or 802.1q VLANs) were used to partition different logical networks (or broadcast domains) over the same physical fabric. Instead of wiring a separate physical infrastructure for each group, VLANs were used efficiently to isolate the traffic from different groups or applications based on the business needs, with a unique identifier allocated to each logical network. For years, a physical server represented one end-point from the network perspective and was attached to an “access” (i.e…

View original 726 more words

Standard
ML2, Open vSwitch, OpenStack, Overlays

OVS ARP Responder – Theory and Practice

Prefix

In the GRE tunnels post I’ve explained how overlay networks are used for connectivity and tenant isolation. In the l2pop post, or layer 2 population, I explained how OVS forwarding tables are pre-populated when instances are brought up. Today I’ll talk about another form of table pre-population – The ARP table. This feature has been introduced with this patch by Edouard Thuleau, merged during the Juno development cycle.

ARP – Why do we need it?

In any environment, be it the physical data-center, your home, or a virtualization cloud, machines need to know the MAC, or physical network address, of the next hop. For example, let there be two machines connected directly via a switch:

The first machine has an IP address of 10.0.0.1, and a MAC address of 0000:DEAD:BEEF,

while the second machine has an IP address of 10.0.0.2, and a MAC address of 2222:FACE:B00C.

I merrily log into the first machine and hit ‘ping 10.0.0.2′, my computer places 10.0.0.2 in the destination IP field of the IP packet, then attempts to place a destination MAC address in the Ethernet header, and politely bonks itself on its digital forehead. Messages must be forwarded out of a computer’s NIC with the destination MAC address of the next hop (In this case 10.0.0,2, as they’re directly connected). This is so switches know where to forward the frame to, for example.

Well, at this point, the first computer has never talked to the second one, so of course it doesn’t know its MAC address. How do you discover something that you don’t know? You ask! In this case, you shout. 10.0.0.1 will flood, or broadcast, an ARP request saying: What is the MAC address of 10.0.0.2? This message will be received by the entire broadcast domain. 10.0.0.2 will receive this message (Amongst others) and happily reply, in unicast: I am 10.0.0.2 and my MAC address is 2222:FACE:B00C. The first computer will receive the ARP reply and will then be able to fill in the destination MAC address field, and finally send the ping.

Will this entire process be repeated every time the two computers wish to talk to each other? No. Sane devices keep a local cache of ARP responses. In Linux you may view the current cache with the ‘arp’ command.

A slightly more complex case would be two computers separated by a layer 3 hop, or a router. In this case the two computers are in different subnets, for example 10.0.0.0/8 and 20.0.0.0/8. When the first computer pings the second one, the OS will notice that the destination is in a different subnet, and thus forward the message to the default gateway. In this case the ARP request will be sent for the MAC address of the pre-configured default gateway IP address. A device only cares about the MAC address of the next hop, not of the final destination.

The absurdity of L2pop without an ARP responder

Let there be VM 1 hosted on compute node A, and VM 2 hosted on compute node B.

With l2pop disabled, when VM 1 sends an initial message to VM 2, compute node A won’t know the MAC address of VM 2 and will be forced to flood the message out all tunnels, to all compute nodes. When the reply is received, node A would learn the MAC address of VM 2 along with the remote node and tunnel ID. This way, future floods are prevented. L2pop prevents even the initial flood by pre-populating the tables, as the Neutron service is aware of VM MAC addresses, scheduling, and tunnel IDs. More information may be found in the dedicated L2pop post.

So, we optimized one broadcast, but what about ARPs? Compute node A is aware of the MAC address (And whereabouts) of VM 2, but VM 1 isn’t. Thus, when sending an initial message from VM 1 to 2, an ARP request will be sent out. Compute node A knows the MAC address of VM 2 but chooses to put a blindfold over its eyes and send a broadcast anyway. Well, with the ARP responder feature this is no longer case.

The OVS ARP responder – How does it work?

A new table is inserted into the br-tun OVS bridge, to be used as an ARP table. Whenever a message is received by br-tun from a local VM, it is classified into unicast, broadcast/multicast and now ARP requests. ARP requests go into the ARP table, where pre-learned MAC addresses (Via l2pop, more in a minute) reside. Rows in this table are then matched against the (ARP protocol, network, IP of the requested VM) tuple. The resulting action is to construct an ARP reply that will contain the IP and MAC addresses of the remote VM, and will be sent back from the port it came in on to the VM making the original request. If a match is not found (For example, if the VM is trying to access a physical device not managed by Neutron, thus was never learned via L2pop), the ARP table contains a final default flow, to resubmit the message to the broadcast/multicast table, and the message will be treated like any old broadcast.

The table is filled whenever new L2pop address changes come in. For example, when VM 3 is hosted on compute C, both compute nodes A and B get a message that a VM 3 with IP address ‘x’ and MAC address ‘y’ is now on host C, in network ‘z’. Thus, compute nodes A and B can now fill their respective ARP tables with VM 3’s IP and MAC addresses.

The interesting code is currently at:

https://github.com/openstack/neutron/blob/master/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py#L484

For help on reading OVS tables, and an explanation of OVS flows and how they’re comprised of match and action parts, please see a previous post.

Blow by blow:

Here’s the action part:

            actions = (‘move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],’ - Place the source MAC address of the request (The requesting VM) as the new reply’s destination MAC address

                       ‘mod_dl_src:%(mac)s,’ - Put the requested MAC address of the remote VM as this message’s source MAC address

                       ‘load:0x2->NXM_OF_ARP_OP[],’ – Put an 0x2 code as the type of the ARP message. 0x2 is an ARP response.

                       ‘move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],’ – Place the ARP request’s source hardware address (MAC) as this new message’s ARP target / destination hardware address

                       ‘move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],’ - Place the ARP request’s source protocol / IP address as the new message’s ARP destination IP address

                       ‘load:%(mac)#x->NXM_NX_ARP_SHA[],’ - Place the requested VM’s MAC address as the source MAC address of the ARP reply

                       ‘load:%(ip)#x->NXM_OF_ARP_SPA[],’ - Place the requested VM’s IP address as the source IP address of the ARP reply

                       ‘in_port’ % {‘mac’: mac, ‘ip’: ip}) - Forward the message back to the port it came in on

Here’s the match part:

            self.tun_br.add_flow(table=constants.ARP_RESPONDER, - Add this new flow to the ARP_RESPONDER table

                                 priority=1, – With a priority of 1 (Another, default flow with the lower priority of 0 is added elsewhere in the code)

                                 proto=‘arp’, - Match only on ARP messages

                                 dl_vlan=lvid, - Match only if the destination VLAN (The message has been locally VLAN tagged by now) matches the VLAN ID / network of the remote VM

                                 nw_dst=%s % ip, – Match on the IP address of the remote VM in question

                                 actions=actions)

Example:

An ARP request comes in.

In the Ethernet frame, the source MAC address is A, the destination MAC address is FFFF:FFFF:FFFF.

In the ARP header, the source IP address is 10.0.0.1, the destination IP is 10.0.0.2, the source MAC is A, and the destination MAC is FFFF:FFFF:FFFF.

Please make sure that entire part makes sense before moving on.

Assuming L2pop has already learned about VM B, the hypervisor’s ARP table will already contain an ARP entry for VM B, with IP 10.0.0.2 and MAC B.

Will this message be matched? Sure, the proto is ‘arp’, they’re in the same network so dl_vlan will be correct, and nw_dst (This part is slightly confusing) will correctly match on the destination IP address of the ARP header, seeing as ARP replaces IP in the third layer during ARP messages.

What will be the action? Well, we’d expect an ARP reply. Remember that ARP replies reverse the source and destination so that the source MAC and IP inside the ARP header are the MAC and IP addresses of the machine we asked about originally, and the destination MAC address in the ARP header is the MAC address of the machine originating the ARP request. Similarly we’d expect that the source MAC of the Ethernet frame would be the MAC of the VM we’re querying about, and the destination MAC of the Ethernet frame would be the MAC of the VM originating the ARP request. If you carefully observe the explanation of the action part above, you would see that this is indeed the case.

Thus, the source MAC of the Ethernet frame would be B, the destination MAC A. In the ARP header, the source IP 10.0.0.2 and source MAC B, while the destination IP 10.0.0.1 and destination MAC A. This ARP reply will be forwarded back through the port which it came in on and will be received by VM A. VM A will unpack the ARP reply and find the MAC address which it queried about in the source MAC address of the ARP header.

Turning it on

Assuming ML2 + OVS >= 2.1:

  • Turn on GRE or VXLAN tenant networks as you normally would
  • Enable l2pop
    • On the Neutron API node, in the conf file you pass to the Neutron service (plugin.ini / ml2_conf.ini):
[ml2]
mechanism_drivers = openvswitch,l2population
    • On each compute node, in the conf file you pass to the OVS agent (plugin.ini / ml2_conf.ini):
[agent]
l2_population = True
  • Enable the ARP responder: On each compute node, in the conf file you pass to the OVS agent (plugin.ini / ml2_conf.ini):
[agent]
arp_responder = True

To summarize, you must use VXLAN or GRE tenant networks, you must enable l2pop, and finally you need to enable the arp_responder flag in the [agent] section in the conf file you pass to the OVS agent on each compute node.

Thanks

Props to Edouard Thuleau for taking the initiative and doing the hard work, and for the rest of the Neutron team in the lengthy review process! It took us nearly 8 months but we finally got it merged, in fantastic shape.

Standard
OpenStack, Talks

Introduction to Neutron

I recently gave an internal Red Hat talk entitled: Introduction to Neutron. It is a high-level, concepts oriented talk.

In it I talk about:

  • Why Neutron?
  • An example of network virtualization
  • Ports, networks and subnets
  • External, provider and tenant networks
  • L3 model – Internal and external subnets, routers, NAT and floating IPs
  • An overview of the different Neutron components
  • Nova <–> Neutron interaction when creating a VM
  • Explanation of the core plugin concept
  • Brief rundown of the service plugins (VPN, Load balancing and Firewalls)

Here’s the PDF.

And the video:

Standard
OpenStack, oVirt, Talks

What Does Open Source Mean to Me?

I gained some development experience in various freelance projects and figured I’d apply for a development position during my last semester of Computer Science studies. I sought a student role in a large corporation so that I wouldn’t be relied upon too heavily, as I wanted to prioritize my studies (Please see ‘You have your entire life left to work’ and similar cliches). I applied to a bunch of places, including Red Hat – My would be boss gave a talk in my school about open source culture and internship positions, otherwise I would have never heard about a Linux company in a Microsoft dominated nation. Microsoft has solid contracts with the Israeli Defense Force, and with the Israeli high tech being lead mostly by ex- IDF officers, CTOs tend to go with Microsoft technology. In any case, Red Hat had an internship position in the networking team of a virtualization product (I had networking experience from my army service), paid generously, their offices were close by, it all lined up.

At this point, open source meant nothing to me.

At Red Hat, I started working on a project called oVirt. While it has an impressive user base, and its Q&A mailing list gets a healthy amount of traffic, it does not have a significant development community outside of Red Hat. Here I started experiencing the efforts that go into building an expansive open source community. Open source is not free contrary to popular belief – It is, in fact, quite costly, for a project in oVirt’s stage. For example, when working in a closed source company and designing a new feature, normally you would write a specification down, discuss it with your team members, and get going. In oVirt, you’d share the specification first with the rest of the community. The resulting discussion can take weeks, and with a time based release schedule that inherent delay must be factored in during planning. All communication must be performed on public (and archived) medias such as mailing lists and IRC channels. Hallway discussions are natural but frowned upon when it comes to feature design and other aspects of the development process that should be shared with the community. Then comes the review process. I’m a big believer in peer reviews, regardless if the project is open or closed, but surely in an open source project the review process is much better felt. One of the key elements to building a community is taking the time to review code submitted by non-Red Hatters. You could never hope to get an active development community going if code sits in the repository for weeks, attracting no attention. To this end, code review becomes part of your job description. Some people do it quite well, some people like me have a lot of room to improve. I find reviewing code infinitely harder than writing it. In fact, I find it so hard that I must force myself to do it, double so when the code is written by a faceless community member that cannot knock a basketball over my head if I don’t review his code (Dear mankind: Please don’t ever invent that technology).

At this point, open source was a burden for me.

Six months back I was moved to another project called OpenStack. Still in the same team, under the same boss, just working on another project. OpenStack, while comparable to oVirt technologically,  is very different from oVirt, in the sense that it has a huge development community. By huge, I mean thousands strong. OpenStack is composed of sub projects – The networking project alone has hundreds of developers working on it regularly. At the time I was moved I was the only Israeli developer working on it. The rest of the Red Hat OpenStack team was located in the Czech Republic and in the US. As you can imagine, a lot of self learning was to be had. Conveniently, the (community maintained) OpenStack documentation is excellent. My team mates were no longer working for the same company I was, nor were they down the hall. I did most of my work with individuals spread all over the world. I met some in FOSDEM this past February (Probably the highlight of the event for me), at which point I began to understand the importance of building personal relationships and I will expand on this below.

The beauty of open source and the basis of a meritocracy is that the strongest idea wins. You might stumble upon an infuriating bug which might seem like the most important issue facing the project (And, in fact, humanity). You start working on it, submit a patch, and quickly discover that nobody gives a shit about your bug. Instead of being frustrated by the difficulty of moving forward, I learned two lessons:

  1. Building personal relationships is the only way to drive change
  2. ‘The community’ can realign your understanding of what is important

Maybe there is good reason nobody cares about that bug. Maybe it was a waste of time working on it, not because the patch was not accepted (In time, or at all), maybe it was a waste of time because it was just a waste of time. Maybe that bug was just not important, and you should have invested your time working on anything else. There is a larger amount of issues than resources available and your choice of what to tackle is more important than the urgency of what’s in front of you.

In addition to navigating between the perceived urgency of issues, the community can help you reflect and choose the better solution. I always love hearing people’s ideas, and this concept is expressed beautifully in the review system. Getting criticism from strangers and collaborators alike always constitutes to a learning experience. Luckily OpenStack is being developed by very smart individuals that can help you understand if your solution is terrible, or simply realign your trajectory. I find that it’s sometimes even helpful to get feedback from people with opposing interests – Perhaps together you can form a solution that will answer all use cases in a generalized manner. Such a solution might just end up to be of higher quality than one that would have dealt only with your own customer’s needs.

At this point, open source is obvious to me.

Standard
ML2, OpenStack, Overlays

ML2 – Address Population

Why do we need it, whatever it is?

VM unicast, multicast and broadcast traffic flow is detailed in my previous post:

Tunnels in Openstack Neutron

TL;DR: Agent OVS flow tables implement learning. That is, any unknown unicast destination (IE: MAC addresses the virtual switch is not familiar with), multicast or broadcast traffic is flooded out tunnels to all other compute nodes. Any incoming traffic is used for its source MAC address. That MAC address is added to a learning table, so future traffic to that MAC address is not flooded but sent directly to the hosting node. There’s several inefficiencies here:

  1. The MAC addresses aren’t initially known by the agents, but the Neutron service has full knowledge of the topology
  2. There’s still a lot of broadcasts going around in the form of ARP requests. Maybe we can optimize those away?
  3. More about broadcasts: What if a node isn’t hosting any ports in a specific network? Should this node receive broadcast traffic designated to that network?

A great visual explanation for the third point, stolen shamelessly from the official OpenStack documentation:

Overview

When using the ML2 plugin with tunnels and a new port goes up, ML2 sends a update_port_postcommit notification which is picked up and processed by the l2pop mechanism driver. l2 pop then gathers the IP and MAC of the port, as well as the host that the port was scheduled on; It then sends an RPC notification to all layer 2 agents. The agents uses the notification to solve the three issues detailed above.

Configuration

ml2_conf.ini:
[ml2]
mechanism_drivers = ..., l2population, ...
[agent]
l2_population = True

Deep-Dive & Code

plugins/ml2/drivers/l2pop/mech_driver.py:update_port_postcommit calls _update_port_up. In _update_port_up we send the new ports’ IP and MAC address to all agents via a ‘add_fdb_entries’ RPC fanout cast. Additionally, if this new port is the first port in a network on the scheduled agent, then we send all IP and MAC addresses on the network to that agent.

‘add_fdb_entries’ is picked up via agent/l2population_rpc.py:add_fdb_entries, which calls fdb_add if the RPC call was a fanout, or directed to the local host.

fdb_add is implemented by the OVS and LB agents: plugins/openvswitch/agent/ovs_neutron_agent.py and plugins/linuxbridge/agent/linuxbridge_neutron_agent.py.

In the OVS agent, fdb_add accomplishes three main things:

For each port received:

  1. Setup a tunnel to the remote agent if one does not already exist
  2. If its a flood entry, setup a flood flow to the remote network. Reminder: A flood flow is sent out to all agents in case a port goes up which happens to be the first port for an agent & network pair
  3. If its a unicast entry, add it to the unicast learning table
  4. A big fat TO-DO about ARP replies. Implemented in the Icehouse release with this patch: https://review.openstack.org/#/c/49227/

Finally, with l2_population = True, a bunch of code is in the ovs agent is disabled. tunnel_update and tunnel_sync RPC messages are ignored, and replaced by fdb_add, fdb_remove.

Supported Topologies

All of this is fully supported since the Havana release when using GRE and VXLAN tunneling with the ML2 plugin, apart from the ARP resolution optimization which is implemented only for the Linux bridge agent with the VXLAN driver. ARP resolution will be added to the OVS agent with GRE and VXLAN drivers in the Icehouse release.

Links

http://docs.openstack.org/admin-guide-cloud/content/ch_networking.html#ml2_l2pop_scenarios

Standard