OpenStack

Tenant, Provider and External Neutron Networks

To this day I see confusion surrounding the terms: Tenant, provider and external networks. No doubt countless words have been spent trying to tease apart these concepts, so I thought that it’d be a good use of my time to write 470 more.

At a Glance

	Creator	Model	Segmentation	External router interfaces
Tenant	User	Self service	Selected by Neutron
Provider	Administrator	Pre created & shared	Selected by the creator
External	Administrator	Pre created & shared	Selected by the creator	Yes

A Closer Look

Tenant networks are created by users, and Neutron is configured to automatically select a network segmentation type like VXLAN or VLAN. The user cannot select the segmentation type.

Provider networks are created by administrators, that can set one or more of the following attributes:

Segmentation type (flat, VLAN, Geneve, VXLAN, GRE)
Segmentation ID (VLAN ID, tunnel ID)
Physical network tag

Any attributes not specified will be filled in by Neutron.

OpenStack Neutron supports self service networking – the notion that a user in a project can articulate their own networking topology, completely isolated from other projects in the same cloud, via the support of overlapping IPs and other technologies. A user can create their own network and subnets without the need to open a support ticket or the involvement of an administrator. The user creates a Neutron router, connects it to the internal and external networks (defined below) and off they go. Using the built-in ML2/OVS solution, this implies using the L3 agent, tunnel networks, floating IPs and liberal use of NAT techniques.

Provider networks (read: pre-created networks) is an entirely different networking architecture for your cloud. You’d forgo the L3 agent, tunneling, floating IPs and NAT. Instead, the administrator creates one or more provider networks, typically using VLANs, shares them with users of the cloud, and disables the ability of users to create networks, routers and floating IPs. When a new user signs up for the cloud, the pre-created networks are already there for them to use. In this model, the provider networks are typically routable – They are advertised to the public internet via physical routers via BGP. Therefor, provider networks are often said to be mapped to pre-existing data center networks, both in terms of VLAN IDs and subnet properties.

External networks are a subset of provider networks with an extra flag enabled (aptly named ‘external’). The ‘external’ attribute of a network signals that virtual routers can connect their external facing interface to the network. When you use the UI to give your router external connectivity, only external networks will show up on the list.

To summarize, I think that the confusion is due to a naming issue. Had the network types been called: self-service networks, data center networks and external networks, this blog post would not have been necessary and the world would have been even more exquisite.

OpenStack

When is it not cool to add a new OpenStack configuration option?

Adding new configuration options has a cost, and makes already complex projects (Hi Neutron!) even more so. Double so when we speak of architecture choices, it means that we have to test and document all permutations. Of course, we don’t always do that, nor do we test all interactions between deployment options and other advanced features, leaving users with fun surprises. With some projects seeing an increased rotation of contributors, we’re seeing wastelands of unmaintained code behind left behind, increasing the importance of being strategic about introducing new complexity.

I categorize the introduction of new OpenStack configuration options to two:

There’s two or more classes of operators that have legitimate use cases and a configuration option would enable those use cases without hurting cloud interoperability
- For example: Neutron DVR, is essentially a driver for router implementations, that changes your L3 architecture substantially while abstracting details via an API. DVR has various costs and benefits and letting operators make that choice, especially as the feature matures, makes sense to me.
Developers that don’t fully understand the choice they are making and pass the complexity down to the operators to figure out. This results in options that are often never changed from their defaults because operators don’t have access to sufficient documentation that explains the rationale for choosing a specific value, as well as a misuse of time and energy because developers sometimes focus on use cases that are off center or don’t exist.
- For example: neutron.conf:DEFAULT:send_events_interval: Number of seconds between sending events to nova if there are any events to send. Unless you grep through the code, how is an operator supposed to know if they should increase or decrease the value? Even then, shouldn’t developers take responsibility of that choice, and test for the best value? If problems are found under load, shouldn’t the value be calculated as a function of some variable? Instead of distributing the work to thousands of operators, wouldn’t we as a community like to to do the work in one place?

When contemplating adding a new option, ask yourself:

Is it possible that you don’t fully understand the use case, and in lieu of making a choice, you’re letting the operator bear the burden?
Are you, and your replacement, prepared to own the cost of the new option indefinitely?

OpenStack, Talks

Upstream Contribution – Give Up or Double Down? The Boston Edition

In continuation to a previous blog post https://assafmuller.com/2016/12/02/upstream-contribution-give-up-or-double-down/, I presented a version with new content at the recent OpenStack Boston summit. The session had a fair amount of participation and discussions, where we talked about the journey of a new contributor in OpenStack through the lens of data gathered from Gerrit and Stackalytics. We even had members of the Technical Committee that were interested in concrete action items they could take on their part – How do we make it easier for new contributors? What behaviors work in the quest to get your patches merged quicker?

The video, slides and code used to generate the data are below:

https://github.com/assafmuller/gerrit_time_to_merge

OpenStack

Is OpenStack Neutron ML2/OVS Production Ready for Large Scale Deployments?

One of my personal highlights of the recent Barcelona Summit was a session by Mirantis engineers Elena and Oleg titled “Is OpenStack Neutron Production Ready for Large Scale Deployments?”. In the session they outline a comprehensive control and data plane testing effort, run on two labs, one with 200 nodes and run of the mill hardware, and the other with 378 and top of the line hardware, all running the Mirantis distribution based off Mitaka with standard ML2/OVS, DVR, L2POP and VXLAN. In the session they show near line-rate speed for east/west and north/south routing with jumbo frames and VXLAN offload enabled. They were also able to spawn 24,500 VMs across 125 networks without errors and low CPU consumption.

Slides on SlideShare

https://twitter.com/assafmuller/status/795605316100587521

Turning our eyes to adoption, the OpenStack Foundation conducts a usage survey every 6 months. Looking at the April 2016 user survey, we can see that ML2 with Open vSwitch and Linux Bridge dwarf other solutions.

https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf, slide 41, taking out “ML2” and “nova-network”.

Examining the openstack/neutron project via Stackalytics we see that ML2/OVS has a rich and robust community with 20 companies contributing over 5 patches in the Newton time frame. 779 people have contributed at least 1 patch to Neutron since its inception, 215 of which during the Newton timeframe. Some of the effort targeted the base Neutron platform, e.g. configuration options, database work, versioned objects, quotas or other in-tree ML2 drivers such as SRIOV. Looking at the contribution of everyone who has committed at least 5 patches in the Newton cycle, we are left with 50 authors, 42 of which contributed at least 1 patch to ML2/OVS.

Taken from http://stackalytics.com/?release=newton&module=neutron&metric=commits after filtering authors with less than 5 patches

Looking at the interactive version of the OpenStack user survey we can see that ML2/OVS is the most popular choice by an order of magnitude regardless of deployment size. And so to answer the question: “Is OpenStack Neutron ML2/OVS Production Ready for Large Scale Deployments?”. Yes, it is, of course it is. It has been for some time now.

OpenStack, Talks

Upstream Contribution – Give Up or Double Down?

Ever since I’ve been involved with OpenStack people have been complaining that upstream is hard. The number one complaint is that it takes forever to get your patches merged. I thought I’d take a look at some data and attempt to visualize it. I wrote some code that accepts an OpenStack project and a list of contributors and spits out a bunch of graphs. For example:

How long does it take to merge patches in a given project over time? Looking back, did governance changes affect anything?
Is there a correlation between the size of a patch and the length of time it takes to merge it? (Spoiler: The answer is… Kind of)
Looking at an author: Does the time to merge patches trend down over time?
Looking at the average length of time it takes to merge patches per author, how does it look like when we graph it as a function of the author’s number of patches? Reviews? Emails? Bug reports? Blueprints implemented?

The data suggestes answers for many of those questions and more.

Here’s the code for you to play with:

https://github.com/assafmuller/gerrit_time_to_merge

And some conclusions in the slides embedded below:

https://docs.google.com/presentation/d/17l720kXrAHJC9_gU81nuIGzJCosralsbtbHGUuXgaxk/edit?usp=sharing

Here’s a few resources about effective upstream contribution. It’s all content written by and for the Neutron community but it’s applicable to any OpenStack project.

https://www.youtube.com/watch?v=ifBXiaFGry4 – “Land Your First Neutron Patch”, an OpenStack Paris summit talk by Rossella Sblendido
http://www.slideshare.net/kevintbenton/how-to-get-reviewers-to-block-your-changes – “How to get reviewers to block you changes”, a hilarious OpenStack summit lightning talk with practical tips by Kevin Benton
http://docs.openstack.org/developer/neutron/devref/effective_neutron.html – “Effective Neutron: 100 specific ways to improve your Neutron contributions”, a collection of best practices both technical and procedural written with blood by Neutron community members over the years

OpenStack

Actionable CI

I’ve observed a persistent theme across valuable and successful CI systems, and that is actionable results.

A CI system for a project as complicated as OpenStack requires a staggering amount of energy to maintain and improve. Often times the responsible parties are focused on keeping it green and are buried under a mountain of continuous failures, legit or otherwise. So much so that they don’t have time to focus on the following questions:

How do you determine that a job failed?
How are the results presented to the relevant developers?
Can developers do anything about a failure?

To bring it to concrete terms let’s take a look at how Rally is used in upstream jobs. This is not a criticism of the Rally project itself, which I’m a big fan of, but rather how it’s used upstream. It uses the standard upstream CI infrastructure, which is a miracle of engineering when it comes to correctness tests. The infrastructure spins up VMs from a node pool comprised of many clouds. It then uses devstack-gate and devstack to install OpenStack and runs several Rally scenarios. When the result of a CI run is True or False, the variance of hardware and congestion levels is irrelevant. However, when you’re trying to measure performance, variance matters. You can try setting a maximum, and any result over the maximum is declared as a failure, however with a variance sufficiently large setting up SLAs is an exercise in futility.

Let’s look at a recent Linux Bridge change [1] that cannot impact Rally results (The Rally job is setup to run against Open vSwitch). Consecutive runs would ideally show the same results. However, looking at the results of patchsets 10, 11, 13 and 14, we can see that the total length of the job runs between 60 and 83 minutes. The full duration of the create_and_list_ports flow runs from 1517 seconds to 1887 seconds. The average for a single create_and_list_ports execution runs between 4.58s and 5.29s. What am I supposed to do with the results of the next run? What can I learn from it? I’d argue: Nothing. The results of the job are not actionable. The result is that the job has been non-voting ever since its introduction and worse yet, none of the engineers I work with look at its results.

The next step would be to give up on the idea of gating or blocking performance regressions and instead detect them after the fact. We can do that by persisting historical results, graphing them and spotting trends. It’s clear that with a variance this large, the results would not be actionable either. To demonstrate this, let’s turn to the fantastic openstack-health project. Looking at the Neutron API test with the longest average run time [2] we can see that at the time of writing, the test ran 249 times in the past month so we get a great sample size. However, the run time graph looks like a Jackson Pollock painting, with a min of just under 5s and a max of just over 9s. Looking at the graph it’s clear we can’t clean up the data via statistical Jiu Jitsu either. When consistency matters, I don’t think you can get around a dedicated bare metal setup.

The Gerrit interface does a great job of presenting CI results, and a failing voting job forces developers to look at its results. However, I don’t know many engineers who look at CI results as a form of amusement. Post-merge and periodic CI runs in to these issues – They burn your favorite form of fossil fuels and drain the life force of the fine folks who maintain it but the results are often not presented in a consumable manner. Running the tests reliably is as important as making sure the intended audience is aware of the results. One solution could be to make sure the relevant developers subscribe to a mailing list, triggering a mail on failures filtered after distracting infrastructure issues. Periodic CI can only be valuable if it’s actionable and developers are held accountable and demonstrate a persistent urgency to failures.

[1] https://review.openstack.org/#/c/346377/
[2] http://status.openstack.org/openstack-health/#/test/neutron.tests.tempest.api.test_auto_allocated_topology.TestAutoAllocatedTopology.test_get_allocated_net_topology_as_tenant?resolutionKey=day&duration=P1M

OpenStack

New Neutron testing guidelines!

Yesterday we merged https://review.openstack.org/#/c/245984/ which adds content to the Neutron testing guidelines:

http://docs.openstack.org/developer/neutron/devref/development.environment.html#testing-neutron

The document details Neutron’s different testing infrastructures:

Unit
Functional
Fullstack (Integration testing with services deployed by the testing infra itself)
In-tree Tempest

The new documentation provides:

Advantages and use cases for each testing framework
Examples
Do’s and don’ts
Good and bad usage of mock
The anatomy of a good unit test

It’s short – I encourage developers to go through it. Reviewers may save time by linking to it when testing anti-patterns pop up.

Enjoy, I hope you’ll find it useful.

OpenStack, Talks

Neutron HA Talk in Tokyo Summit

Florian Haas, Adam Spiers and myself presented a session in Tokyo about Neutron high availability. We talked about:

API service HA using haproxy and Pacemaker
Active/active DHCP HA
Router high availability
What the SUSE and Red Hat solutions look like

Slides: http://fghaas.github.io/openstacksummit2015-tokyo-neutron-ha/
Source: https://github.com/fghaas/openstacksummit2015-tokyo-neutron-ha

OpenStack

Neutron Troubleshooting

I recently gave an internal talk about Neutron troubleshooting and wanted to share it with the world. Here’s the slide deck and video, enjoy!

OpenStack

Neutron in-tree integration tests

It’s time for OpenStack projects to take ownership of their quality. Introducing in-tree, whitebox multinode simulated integration testing. A lot of work went in over the last few months by a lot of people to make it happen.

http://docs.openstack.org/developer/neutron/devref/fullstack_testing.html

We plan on adding integration tests for many of the more evolved Neutron features over the coming months.

Assaf Muller

Category Archives: OpenStack

Tenant, Provider and External Neutron Networks

At a Glance

A Closer Look

When is it not cool to add a new OpenStack configuration option?

Upstream Contribution – Give Up or Double Down? The Boston Edition

Is OpenStack Neutron ML2/OVS Production Ready for Large Scale Deployments?

Upstream Contribution – Give Up or Double Down?

Actionable CI

New Neutron testing guidelines!

Neutron HA Talk in Tokyo Summit

Neutron Troubleshooting

Neutron in-tree integration tests