Another fun issue with an OpenStack platform this week: a lost Keystone project. This is the story of how we brought this project back to life without loosing existing resources.

We have a small OpenStack platform running in our Objectif Libre office in Toulouse, France. We use it internally to run test instances. It's running Ocata, and the Keystone setup uses the domains feature to separate service and temporary accounts (default domain) from LDAP-backed accounts (olcorp domain). The only project in the olcorp domain, lab, holds all our virtual resources.

Luke's problem

My colleague Luke (fictional name) could not login anymore at some point this week. He received this very explicit message: "You are not authorized for any projects or domains."

Not cool.

He uses OpenStack a lot, knows what he's doing, and his account had not been suspended. I tried with my own account: same error. I tried again with the cloud-admin account this time - stored in the Keystone database, not on the LDAP server. Everything went fine, I could perform requests. One of those requests was:

openstack project list --domain olcorp

Empty answer. No project means no way to create or access resources, even if authentication is valid.

The lab project had disappeared.

Restoring the project

When a project is removed from the Keystone database, the associated resources (instances, volumes, networks, ...) are not destroyed. This might appear as a maintenance problem but in our case it's been quite useful.

I hoped that Keystone used soft-deletion of database resources (the data would still be there, but marked as deleted), but no luck there.

The revival of the project required a few steps:

  1. Creation of a new lab project. This is a start but is not enough: the ID of the new project doesn't match the ID of the removed one. All the OpenStack resources are associated to a project using its ID, so we needed the same ID. It is not possible to change/define the project ID using the API (AFAIK).

  2. Bit of MySQL tweaking. I try to avoid modifying resources on the SQL server as much as I can but it can be very handy:

    . openrc.sh  # source the OpenStack env file to get the old project ID
    mysql keystonedb -e "update project set id='$OS_PROJECT_ID' where name='lab'"
    
  3. Setup of the roles for users. We use LDAP group-based authorization, with only 2 roles (admin and _member_) so restoring the permissions has been easy to do. It might have been more painful with more roles, groups or users.

The process has been very easy and restoring the project took very little time.

We still don't know what happened on the platform, and why the project disappeared, but the keystone access log is quite clear:

10.78.1.21 - - [28/Apr/2017:22:24:20 +0200] "DELETE /v3/projects/68a93cc709b44de08cfd11e6bdac2b9b HTTP/1.1" 204 281 "-" "python-keystoneclient"

Could be a human error or a bug (seems unlikely but eh). Will be worth a new blog post if we ever find out :)


The context

This week we upgraded an OpenStack platform from Liberty to Mitaka for a customer. Small platform, no need to keep the APIs up, no big deal.

The platform has 3 controller/network nodes, and 3 computes. The neutron configuration is quite common: Open vSwitch ML2 mechanism, self-service networks, virtual routers and floating IPs, L3 HA.

At Objectif Libre we use a home-grown ansible playbook to deploy and upgrade OpenStack platforms, and everything went fine. Well, almost.

The problem

After the L3 agent upgrade and restart the routers were still doing their job, but adding new interfaces to them didn't work. We checked the logs. The agent was logging a lot of python traces, such as this one:

2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '8a776bc6-b2e3-4439-b122-45ce7479b0a8'
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 440, in _process_router_if_compatible
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self._process_updated_router(router)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 454, in _process_updated_router
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     ri.process(self)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 389, in process
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.enable_keepalived()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 123, in enable_keepalived
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.keepalived_manager.spawn()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py", line 401, in spawn
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     keepalived_pm.enable(reload_cfg=True)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 94, in enable
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.reload_cfg()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 97, in reload_cfg
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.disable('HUP')
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 109, in disable
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     utils.execute(cmd, run_as_root=True)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 116, in execute
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     execute_rootwrap_daemon(cmd, process_input, addl_env))
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 102, in execute_rootwrap_daemon
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     return client.execute(cmd, process_input)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 128, in execute
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     try:
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "<string>", line 2, in run_one_command
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     raise convert_to_error(kind, result)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent NoFilterMatched

The trace shows that a problem occured in oslo_rootwrap, leading to a NoFilterMatched exception. oslo.rootwrap is the library that allows unprivileged applications such as neutron-l3-agent to execute system commands as root. It is based on sudo and provides its own authorization mechanism.

The NoFilterMatched exception is raised by oslo_rootwrap when the requested command doesn't match any of those defined in the configuration. This is great, but which command?

Activating the debug in the L3 agent didn't help, the problematic command still didn't show in the logs.

So we patched oslo to make it a bit more verbose. We modified /usr/lib/python2.7/site-packages/oslo_rootwrap/client.py:

--- client.py.orig   2017-04-22 08:19:16.463450594 +0200
+++ client.py        2017-04-22 08:21:51.590386941 +0200
@@ -121,6 +121,7 @@
             return self._proxy

     def execute(self, cmd, stdin=None):
+        LOG.info('CMD: %s' % cmd)
         self._ensure_initialized()
         proxy = self._proxy
         retry = False

After the L3 agent restart the logs became a bit more interesting:

2017-04-20 14:57:28.602 10262 INFO oslo_rootwrap.client [req-f6dc5751-96e3-41b8-86bc-f7f98ff26f12 - 3ce2f82bc46b429285ba0e17840e6cf7 - - -] CMD: ['kill', '-HUP', '14410']
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent [req-f6dc5751-96e3-41b8-86bc-f7f98ff26f12 - 3ce2f82bc46b429285ba0e17840e6cf7 - - -] Failed to process compatible router 'eb356f30-98c9-4641-9f99-2ad91a6a7223'
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
[...]
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 129, in execute
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     res = proxy.run_one_command(cmd, stdin)
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "<string>", line 2, in run_one_command
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     raise convert_to_error(kind, result)
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent NoFilterMatched

The L3 agent was trying to send a signal to a process with PID 14410. ps told us more about it:

root     14410  0.0  0.0 111640  1324 ?        Ss   mars01   3:28 keepalived -P [...]

keepalived is used by the L3 agent for the router HA feature. For each router a VRRP/keepalived process is started to handle the failover in case a node goes down.

So neutron was not authorized to send signals to this process.

The solution

Knowing that the problem was related to a missing authorization in the oslo_rootwrap configuration we did a bit of digging in the configuration files:

$ grep keepalived /usr/share/neutron/rootwrap/*.filters
/usr/share/neutron/rootwrap/l3.filters:keepalived: CommandFilter, keepalived, root
/usr/share/neutron/rootwrap/l3.filters:kill_keepalived: KillFilter, root, /usr/sbin/keepalived, -HUP, -15, -9

The configuration allowed neutron to send signals to /usr/sbin/keepalived processes, but our process was called keepalived, without absolute path. So we added a new configuration do deal with the existing processes:

kill_keepalived_no_path: KillFilter, root, keepalived, -HUP, -15, -9

After a restart the L3 agent started to act as expected again.

Conclusion

Mitaka is a somewhat old realease in OpenStack terms, and we didn't face this problem during upgrades to more recent OpenStack versions.

Knowing how to read python traces and how to dig into OpenStack code is still an interesting skill to possess to be able to understand situations like this one (google didn't help much).

rootwrap usually does its job quite well and this problem gave us the opportunity to better understand how it works and how to deal with its configuration.


Note

This blog assumes that you have already setup a Ceph RadosGW with Keystone authentication.

The keystone admin token is the old, unsecure and deprecated method to authenticate against an OpenStack Identity server. It's been used to bootstrap OpenStack users and projects creation, and a good practice was to disable this feature completely to avoid bad security surprises.

But the Ceph RadosGW documentation for the stable releases - jewel as of this writing - clearly states that you need to use this admin token, and that there's no other way to connect with Keystone:

Well that's not true.

Support for authentication using a service account has been supported in quite a while, but never documented. Keystone v3 is also supported since the jewel release. The master docs have nice updates:

For keystone v3 you can use something like this in your ceph.conf:

[client.rgw.HOSTNAME]
rgw keystone url = http://keystone.host:35357
rgw keystone admin user = ceph
rgw keystone admin password = S3Cr3t
rgw keystone admin project = admin
rgw keystone admin domain = default
rgw keystone api version = 3
...

You need to create a ceph service account and give it the admin role:

$ openstack user create ceph --password-prompt
$ openstack role add --user ceph --project admin admin

Don't forget to disable the admin_token_auth filter from your paste-deploy pipeline in /etc/keystone/keystone-paste.ini.