Preface

One of the most frequent questions I get in regards to the (now legacy) HAProxy in namespace LBaaS driver, is about high availability options for it.

Up until OpenStack Ocata, the direct implication of having a faulty OpenStack node that hosts Neutron LBaaSv2 agent was that all loadbalancers scheduled to that agent were out of service.

Needless to say, such scenario is not sustainable for production environments. It essentially creates a single point of failure for each created loadbalancer, even if that OpenStack deployment consists of multiple LBaaSv2 agents (one per each node).

Starting Ocata,  the LBaaS community has a solution for the above-mentioned scenario.

But before we dive in, I’d like to point out yet another point of failure with the LBaaS Agent + HAProxy in namespace implementation: the HAProxy process itself. Thus, in order to truly make that implementation highly available, we’ll also need to take into account a scenario in which the HAProxy process died unexpectedly.  For that, I Implemented ProcessMonitor for the haproxy driver,  which I’ll describe in part II, so stay tuned 🙂

Overview

The neutron-server uses a periodic check mechanism (AKA agent status check worker) to monitor the status of some of its agents. By leveraging the same mechanism and making the resource auto rescheduling logic more generic, we can now monitor and trigger a loadbalancer reschedule event when an LBaaS agent goes out of service. By that, we essentially guarantee a  loadbalancer service longevity, since it now resides on a functioning agent.

Note that when a loadbalancer gets rescheduled to a new agent, it is there to stay as long as that agent is alive (no preemption). The meaning of this is that if for example, a loadbalancer got rescheduled from agent A to agent B, it will not shift back to agent A when it is recovered. This is done by design, in order to avoid needless service interruptions. Moreover, when agent A recovers it will sync and realize it no longer possesses the aforementioned loadbalancer. As a result, it will kill any orphan HAProxy processes running on that node. This is done to avoid an IP conflict with the VIP address, which now resides under agent B, on a different node.

Configuration

  • In/etc/neutron/neutron_lbaas.conf, set the following:

  • Update systemd (or any other equivalent you use) configuration to provide the above-mentioned config file as a parameter for neutron-server
    • neutron-server example from devstack

    • systemd configuration file ExecStart example (using –config-dir),  taken from here

  • Restart neutron-server for changes to take effect

Congratulations! from now on your loadbalancers will live long and prosper.

Demo

Setup

For this demo, I used two devstack nodes running with the following configuration:

  • Main node configuration: here
  • Secondary node configuration: here

The end result should look like this:

Lastly, you’ll need to alter your neutron configuration files as mentioned in the configuration section of this blog post.

Now for the actual demo

  • Create a Loadbalancer

  • Create a listener so haproxy will get spawned

  • Check on which lbaas agent the loadbalancer got scheduled

  • Verify that haproxy runs on that node, In addition, verify that it is currently not running on the other node

  • As a follow-up to the previous step, verify that it is currently not running on the other node

  • Kill the hosting lbaas agent process and wait for neutron-server to recognize this and reschedule the loadbalancer:

  • Verify that the loadbalancer was indeed rescheduled to the alive lbaas agent and that haproxy is running:

  • Revive the lbaas agent you killed earlier, verify haproxy is no longer running on that node and that the neutron-server recognize the agent:

Deep Dive Into The Code

When you specify ‘lbaasv2’ in service_plugins (/etc/neutron/neutron.conf), neutron-server will load the LoadBalancerPluginv2 class. This is pointed out in setup.cfg. The code flow I’m about to describe runs entirely on server side.

If a periodic job is available for the driver, which in our case it is since we are discussing an LBaaSv2 agent based driver, LoadBalancerPluginv2 will add it as a worker in neutron-server.

Now, let’s see exactly what is that periodic job that we just added.  Looking at get_periodic_jobs() at agent_driver_base.py, you can clearly see why this not going to work if you skip the Configuration section. At the time of writing this blog post, the allow_automatic_lbaas_agent_failover option defaults to ‘False’.

Looking more closely at reschedule_lbaas_from_down_agents(), also at agent_driver_base.py, we can see it invoks the generic logic of reschedule_resources_from_down_agents() specifically for loadbalancers.

Let’s pay special attention to get_down_bindings (line#5) and reschedule_resource (line#9), those would be the key to understand how this works for loadbalancers.

A down binding is how we determine when a loadbalancer(s) should get rescheduled.  When an LBaaSv2 agent:

  • Is on, meaning  admin_state_up = True.
  • Has one or more loadbalancers currently scheduled to it.
  • Marked as down, meaning it did not send a heartbeat for a configurable amount of time. The threshold for this is set here.

The list of loadbalancers to reschedule by those criteria get queried from the database, in get_down_loadbalancer_bindings().

Examine the rescheduling procedure:

This procedure consists of two main parts:

  • Schedule a loadbalancer, which simply fetches the loadbalancer details from the database and reuse the same loadbalancer create flow. From that point is exactly the same flow we use for newly created loadbalancers, hence schedule it for an LBaaSv2 agent.

Now that we know exactly how loadbalancer rescheduling works, let’s examine the generic resources rescheduling that invokes it.

Let’s break down the mechanics of this:

  • Up until line#16, we get all down binding we should iterate on.
  • We cannot know how long it took the line#19 for loop to get to each specific agent (as part of the binding we currently work on). Therefore, we dedicate line#20-line#32 to make sure the down binding is still relevant in order to refrain from rescheduling a resource from an alive agent.
  • Line#43 is the actual attempt to reschedule the resource, in our case the loadbalancer.

Appendix

Patches involved: