OpenStack upgrade: when the L3 agent went cuckoo

 neutron  rootwrap  upgrade  Sat 22 April 2017

The context

This week we upgraded an OpenStack platform from Liberty to Mitaka for a customer. Small platform, no need to keep the APIs up, no big deal.

The platform has 3 controller/network nodes, and 3 computes. The neutron configuration is quite common: Open vSwitch ML2 mechanism, self-service networks, virtual routers and floating IPs, L3 HA.

At Objectif Libre we use a home-grown ansible playbook to deploy and upgrade OpenStack platforms, and everything went fine. Well, almost.

The problem

After the L3 agent upgrade and restart the routers were still doing their job, but adding new interfaces to them didn't work. We checked the logs. The agent was logging a lot of python traces, such as this one:

2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '8a776bc6-b2e3-4439-b122-45ce7479b0a8'
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 440, in _process_router_if_compatible
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self._process_updated_router(router)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 454, in _process_updated_router
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     ri.process(self)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 389, in process
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.enable_keepalived()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 123, in enable_keepalived
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.keepalived_manager.spawn()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py", line 401, in spawn
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     keepalived_pm.enable(reload_cfg=True)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 94, in enable
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.reload_cfg()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 97, in reload_cfg
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.disable('HUP')
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 109, in disable
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     utils.execute(cmd, run_as_root=True)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 116, in execute
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     execute_rootwrap_daemon(cmd, process_input, addl_env))
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 102, in execute_rootwrap_daemon
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     return client.execute(cmd, process_input)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 128, in execute
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     try:
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "<string>", line 2, in run_one_command
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     raise convert_to_error(kind, result)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent NoFilterMatched

The trace shows that a problem occured in oslo_rootwrap, leading to a NoFilterMatched exception. oslo.rootwrap is the library that allows unprivileged applications such as neutron-l3-agent to execute system commands as root. It is based on sudo and provides its own authorization mechanism.

The NoFilterMatched exception is raised by oslo_rootwrap when the requested command doesn't match any of those defined in the configuration. This is great, but which command?

Activating the debug in the L3 agent didn't help, the problematic command still didn't show in the logs.

So we patched oslo to make it a bit more verbose. We modified /usr/lib/python2.7/site-packages/oslo_rootwrap/client.py:

--- client.py.orig   2017-04-22 08:19:16.463450594 +0200
+++ client.py        2017-04-22 08:21:51.590386941 +0200
@@ -121,6 +121,7 @@
             return self._proxy

     def execute(self, cmd, stdin=None):
+        LOG.info('CMD: %s' % cmd)
         self._ensure_initialized()
         proxy = self._proxy
         retry = False

After the L3 agent restart the logs became a bit more interesting:

2017-04-20 14:57:28.602 10262 INFO oslo_rootwrap.client [req-f6dc5751-96e3-41b8-86bc-f7f98ff26f12 - 3ce2f82bc46b429285ba0e17840e6cf7 - - -] CMD: ['kill', '-HUP', '14410']
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent [req-f6dc5751-96e3-41b8-86bc-f7f98ff26f12 - 3ce2f82bc46b429285ba0e17840e6cf7 - - -] Failed to process compatible router 'eb356f30-98c9-4641-9f99-2ad91a6a7223'
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
[...]
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 129, in execute
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     res = proxy.run_one_command(cmd, stdin)
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "<string>", line 2, in run_one_command
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     raise convert_to_error(kind, result)
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent NoFilterMatched

The L3 agent was trying to send a signal to a process with PID 14410. ps told us more about it:

root     14410  0.0  0.0 111640  1324 ?        Ss   mars01   3:28 keepalived -P [...]

keepalived is used by the L3 agent for the router HA feature. For each router a VRRP/keepalived process is started to handle the failover in case a node goes down.

So neutron was not authorized to send signals to this process.

The solution

Knowing that the problem was related to a missing authorization in the oslo_rootwrap configuration we did a bit of digging in the configuration files:

$ grep keepalived /usr/share/neutron/rootwrap/*.filters
/usr/share/neutron/rootwrap/l3.filters:keepalived: CommandFilter, keepalived, root
/usr/share/neutron/rootwrap/l3.filters:kill_keepalived: KillFilter, root, /usr/sbin/keepalived, -HUP, -15, -9

The configuration allowed neutron to send signals to /usr/sbin/keepalived processes, but our process was called keepalived, without absolute path. So we added a new configuration do deal with the existing processes:

kill_keepalived_no_path: KillFilter, root, keepalived, -HUP, -15, -9

After a restart the L3 agent started to act as expected again.

Conclusion

Mitaka is a somewhat old realease in OpenStack terms, and we didn't face this problem during upgrades to more recent OpenStack versions.

Knowing how to read python traces and how to dig into OpenStack code is still an interesting skill to possess to be able to understand situations like this one (google didn't help much).

rootwrap usually does its job quite well and this problem gave us the opportunity to better understand how it works and how to deal with its configuration.

Comments !