The context
This week we upgraded an OpenStack platform from Liberty to Mitaka for a customer. Small platform, no need to keep the APIs up, no big deal.
The platform has 3 controller/network nodes, and 3 computes. The neutron configuration is quite common: Open vSwitch ML2 mechanism, self-service networks, virtual routers and floating IPs, L3 HA.
At Objectif Libre we use a home-grown ansible playbook to deploy and upgrade OpenStack platforms, and everything went fine. Well, almost.
The problem
After the L3 agent upgrade and restart the routers were still doing their job, but adding new interfaces to them didn't work. We checked the logs. The agent was logging a lot of python traces, such as this one:
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '8a776bc6-b2e3-4439-b122-45ce7479b0a8'
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent self._process_router_if_compatible(router)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 440, in _process_router_if_compatible
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent self._process_updated_router(router)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 454, in _process_updated_router
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent ri.process(self)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 389, in process
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent self.enable_keepalived()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 123, in enable_keepalived
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent self.keepalived_manager.spawn()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py", line 401, in spawn
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent keepalived_pm.enable(reload_cfg=True)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 94, in enable
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent self.reload_cfg()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 97, in reload_cfg
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent self.disable('HUP')
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 109, in disable
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent utils.execute(cmd, run_as_root=True)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 116, in execute
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent execute_rootwrap_daemon(cmd, process_input, addl_env))
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 102, in execute_rootwrap_daemon
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent return client.execute(cmd, process_input)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 128, in execute
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent try:
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "<string>", line 2, in run_one_command
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent raise convert_to_error(kind, result)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent NoFilterMatched
The trace shows that a problem occured in oslo_rootwrap, leading to a NoFilterMatched exception. oslo.rootwrap is the library that allows unprivileged applications such as neutron-l3-agent to execute system commands as root. It is based on sudo and provides its own authorization mechanism.
The NoFilterMatched exception is raised by oslo_rootwrap when the requested command doesn't match any of those defined in the configuration. This is great, but which command?
Activating the debug in the L3 agent didn't help, the problematic command still didn't show in the logs.
So we patched oslo to make it a bit more verbose. We modified /usr/lib/python2.7/site-packages/oslo_rootwrap/client.py:
--- client.py.orig 2017-04-22 08:19:16.463450594 +0200
+++ client.py 2017-04-22 08:21:51.590386941 +0200
@@ -121,6 +121,7 @@
return self._proxy
def execute(self, cmd, stdin=None):
+ LOG.info('CMD: %s' % cmd)
self._ensure_initialized()
proxy = self._proxy
retry = False
After the L3 agent restart the logs became a bit more interesting:
2017-04-20 14:57:28.602 10262 INFO oslo_rootwrap.client [req-f6dc5751-96e3-41b8-86bc-f7f98ff26f12 - 3ce2f82bc46b429285ba0e17840e6cf7 - - -] CMD: ['kill', '-HUP', '14410']
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent [req-f6dc5751-96e3-41b8-86bc-f7f98ff26f12 - 3ce2f82bc46b429285ba0e17840e6cf7 - - -] Failed to process compatible router 'eb356f30-98c9-4641-9f99-2ad91a6a7223'
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent self._process_router_if_compatible(router)
[...]
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 129, in execute
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent res = proxy.run_one_command(cmd, stdin)
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent File "<string>", line 2, in run_one_command
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent raise convert_to_error(kind, result)
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent NoFilterMatched
The L3 agent was trying to send a signal to a process with PID 14410. ps told us more about it:
root 14410 0.0 0.0 111640 1324 ? Ss mars01 3:28 keepalived -P [...]
keepalived is used by the L3 agent for the router HA feature. For each router a VRRP/keepalived process is started to handle the failover in case a node goes down.
So neutron was not authorized to send signals to this process.
The solution
Knowing that the problem was related to a missing authorization in the oslo_rootwrap configuration we did a bit of digging in the configuration files:
$ grep keepalived /usr/share/neutron/rootwrap/*.filters
/usr/share/neutron/rootwrap/l3.filters:keepalived: CommandFilter, keepalived, root
/usr/share/neutron/rootwrap/l3.filters:kill_keepalived: KillFilter, root, /usr/sbin/keepalived, -HUP, -15, -9
The configuration allowed neutron to send signals to /usr/sbin/keepalived processes, but our process was called keepalived, without absolute path. So we added a new configuration do deal with the existing processes:
kill_keepalived_no_path: KillFilter, root, keepalived, -HUP, -15, -9
After a restart the L3 agent started to act as expected again.
Conclusion
Mitaka is a somewhat old realease in OpenStack terms, and we didn't face this problem during upgrades to more recent OpenStack versions.
Knowing how to read python traces and how to dig into OpenStack code is still an interesting skill to possess to be able to understand situations like this one (google didn't help much).
rootwrap usually does its job quite well and this problem gave us the opportunity to better understand how it works and how to deal with its configuration.