I needed to deploy an OpenShift Origin instance for testing purposes. This article describes how I used openshift-ansible to deploy the software.

Existing tools

There are several solutions to do this:

These solutions work fine but provide a limited set of features by default.

Environment

I used a x86 physical server for the deployment:

  • 8 cores
  • 32G RAM
  • 2 x 1T disks

OpenShift and the ansible playbook only support RedHat-like distributions. I used a minimal CentOS 7.4 installation without SELinux, and without firewalld.

The machine DNS name is op1.pocentek.net.

Docker setup

The OpenShift playbook requires a working docker-engine installation on the target host. For better performance OpenShift recommends to use the overlay2 storage driver. This driver requires an XFS filesystem.

Docker installation steps:

# mkfs.xfs /dev/sdb1  # dedicated disk for docker in this setup
# mkdir /var/lib/docker
# echo '/dev/sdb1 /var/lib/docker xfs defaults 0 0' >> /etc/fstab
# mount -a
# yum install -y docker
# echo '{"storage-driver": "overlay2"}' > /etc/docker/daemon.json
# systemctl enable docker.service
# systemctl start docker.service
# docker ps  # make sure you can talk to the docker daemon

DNS setup

To benefit from OpenShift routing feature I defined a wildcard A entry in the pocentek.net DNS zone:

*.oc.pocentek.net. IN A 12.34.56.78

This allows dynamic resolution for all the application deployed on OpenShift, as long as they are routed using a matching domain name.

Playbook configuration

The OpenShift playbook requires only a few variables to be set to perform the installation. But a single node setup requires a few tweaks.

You first need to retrieve the code. I used the 3.6 version of OpenShift if this example:

$ git clone https://github.com/openshift/openshift-ansible.git
$ cd openshift-ansible
$ git checkout --track origin/release-3.6

All the settings are defined in an inventory file. I used the following inventory/hosts file:

[OSEv3:children]
masters
nodes
etcd

[masters]
op1.pocentek.net openshift_public_hostname="{{ inventory_hostname }}" openshift_hostname="{{ ansible_default_ipv4.address }}"

[etcd]
op1.pocentek.net

[nodes]
op1.pocentek.net openshift_node_labels="{'region': 'primary', 'zone': 'default'}" openshift_schedulable=true

[OSEv3:vars]
ansible_ssh_user=root
ansible_become=no

openshift_deployment_type=origin
openshift_release=v3.6

openshift_master_default_subdomain=oc.pocentek.net

openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}]
openshift_master_htpasswd_users={'gpocentek': 'some_htpasswd_encrypted_passwd'}

openshift_hosted_router_replicas=1
openshift_hosted_registry_replicas=1

openshift_router_selector='region=primary'
openshift_registry_selector='region=primary'

Some variables require a bit of explanation:

openshift_schedulable=true
By default a master node will be configured to be ignored by the OpenShift scheduler. Application containers will not be created on masters. Since we only have one node, the master should be configured to host application containers.
openshift_router_selector and openshift_registry_selector

Routers (which expose services to the outside world) and the docker registry both run as containers on one or several nodes of the OpenShift cluster. By default they run on infrastructure nodes: dedicated nodes hosting internal services. To make sure that these services are properly scheduled and started on the single node deployment we explicitly label the node (region: primay) and configure the router and registry selector to match this node.

We also make sure that only 1 container is scheduled for each service (openshift_hosted_{router,registry}_replicas).

openshift_master_htpasswd_users
In this setup htpasswd authentication is used, and a gpocentek user is created by the playbook. You can generate the encrypted password using the htpasswd tool.

The node can be deployed using:

$ ansible-playbook -i inventory/hosts playbooks/byo/config.yml

Note

Note: you can find sample inventories in inventory/byo/.

Storage

One feature I couldn't manage to deploy is persistent storage support. Since the deployment isn't meant for production, I used a NFS server deployed on the OpenShift machine to provide PVs:

for i in {1..9}; do
    mkdir -p /exports/volumes/vol0$i
    chown nfsnobody:nfsnobody /exports/volumes/vol0$i
    chmod 775 /exports/volumes/vol0$i
    echo "/exports/volumes/vol0$i *(rw,root_squash,no_wdelay)" >> /etc/exports

    cat | oc create -f - << EOF
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: pv0$i
    spec:
      capacity:
        storage: 5Gi
      accessModes:
        - ReadWriteOnce
      nfs:
        path: /exports/volumes/vol0$i
        server: 172.17.0.1
      persistentVolumeReclaimPolicy: Recycle
    EOF
done

Containers using PVCs created using these PVs most define a custom securityContext:

securityContext:
  supplementalGroups: [65534]

References: https://docs.openshift.org/latest/install_config/persistent_storage/persistent_storage_nfs.html#nfs-supplemental-groups


Another fun issue with an OpenStack platform this week: a lost Keystone project. This is the story of how we brought this project back to life without loosing existing resources.

We have a small OpenStack platform running in our Objectif Libre office in Toulouse, France. We use it internally to run test instances. It's running Ocata, and the Keystone setup uses the domains feature to separate service and temporary accounts (default domain) from LDAP-backed accounts (olcorp domain). The only project in the olcorp domain, lab, holds all our virtual resources.

Luke's problem

My colleague Luke (fictional name) could not login anymore at some point this week. He received this very explicit message: "You are not authorized for any projects or domains."

Not cool.

He uses OpenStack a lot, knows what he's doing, and his account had not been suspended. I tried with my own account: same error. I tried again with the cloud-admin account this time - stored in the Keystone database, not on the LDAP server. Everything went fine, I could perform requests. One of those requests was:

openstack project list --domain olcorp

Empty answer. No project means no way to create or access resources, even if authentication is valid.

The lab project had disappeared.

Restoring the project

When a project is removed from the Keystone database, the associated resources (instances, volumes, networks, ...) are not destroyed. This might appear as a maintenance problem but in our case it's been quite useful.

I hoped that Keystone used soft-deletion of database resources (the data would still be there, but marked as deleted), but no luck there.

The revival of the project required a few steps:

  1. Creation of a new lab project. This is a start but is not enough: the ID of the new project doesn't match the ID of the removed one. All the OpenStack resources are associated to a project using its ID, so we needed the same ID. It is not possible to change/define the project ID using the API (AFAIK).

  2. Bit of MySQL tweaking. I try to avoid modifying resources on the SQL server as much as I can but it can be very handy:

    . openrc.sh  # source the OpenStack env file to get the old project ID
    mysql keystonedb -e "update project set id='$OS_PROJECT_ID' where name='lab'"
    
  3. Setup of the roles for users. We use LDAP group-based authorization, with only 2 roles (admin and _member_) so restoring the permissions has been easy to do. It might have been more painful with more roles, groups or users.

The process has been very easy and restoring the project took very little time.

We still don't know what happened on the platform, and why the project disappeared, but the keystone access log is quite clear:

10.78.1.21 - - [28/Apr/2017:22:24:20 +0200] "DELETE /v3/projects/68a93cc709b44de08cfd11e6bdac2b9b HTTP/1.1" 204 281 "-" "python-keystoneclient"

Could be a human error or a bug (seems unlikely but eh). Will be worth a new blog post if we ever find out :)


The context

This week we upgraded an OpenStack platform from Liberty to Mitaka for a customer. Small platform, no need to keep the APIs up, no big deal.

The platform has 3 controller/network nodes, and 3 computes. The neutron configuration is quite common: Open vSwitch ML2 mechanism, self-service networks, virtual routers and floating IPs, L3 HA.

At Objectif Libre we use a home-grown ansible playbook to deploy and upgrade OpenStack platforms, and everything went fine. Well, almost.

The problem

After the L3 agent upgrade and restart the routers were still doing their job, but adding new interfaces to them didn't work. We checked the logs. The agent was logging a lot of python traces, such as this one:

2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent [-] Failed to process compatible router '8a776bc6-b2e3-4439-b122-45ce7479b0a8'
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 440, in _process_router_if_compatible
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self._process_updated_router(router)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 454, in _process_updated_router
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     ri.process(self)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 389, in process
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.enable_keepalived()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/ha_router.py", line 123, in enable_keepalived
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.keepalived_manager.spawn()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/keepalived.py", line 401, in spawn
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     keepalived_pm.enable(reload_cfg=True)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 94, in enable
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.reload_cfg()
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 97, in reload_cfg
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     self.disable('HUP')
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/external_process.py", line 109, in disable
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     utils.execute(cmd, run_as_root=True)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 116, in execute
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     execute_rootwrap_daemon(cmd, process_input, addl_env))
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py", line 102, in execute_rootwrap_daemon
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     return client.execute(cmd, process_input)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 128, in execute
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     try:
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "<string>", line 2, in run_one_command
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent     raise convert_to_error(kind, result)
2017-04-20 14:54:50.371 29021 ERROR neutron.agent.l3.agent NoFilterMatched

The trace shows that a problem occured in oslo_rootwrap, leading to a NoFilterMatched exception. oslo.rootwrap is the library that allows unprivileged applications such as neutron-l3-agent to execute system commands as root. It is based on sudo and provides its own authorization mechanism.

The NoFilterMatched exception is raised by oslo_rootwrap when the requested command doesn't match any of those defined in the configuration. This is great, but which command?

Activating the debug in the L3 agent didn't help, the problematic command still didn't show in the logs.

So we patched oslo to make it a bit more verbose. We modified /usr/lib/python2.7/site-packages/oslo_rootwrap/client.py:

--- client.py.orig   2017-04-22 08:19:16.463450594 +0200
+++ client.py        2017-04-22 08:21:51.590386941 +0200
@@ -121,6 +121,7 @@
             return self._proxy

     def execute(self, cmd, stdin=None):
+        LOG.info('CMD: %s' % cmd)
         self._ensure_initialized()
         proxy = self._proxy
         retry = False

After the L3 agent restart the logs became a bit more interesting:

2017-04-20 14:57:28.602 10262 INFO oslo_rootwrap.client [req-f6dc5751-96e3-41b8-86bc-f7f98ff26f12 - 3ce2f82bc46b429285ba0e17840e6cf7 - - -] CMD: ['kill', '-HUP', '14410']
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent [req-f6dc5751-96e3-41b8-86bc-f7f98ff26f12 - 3ce2f82bc46b429285ba0e17840e6cf7 - - -] Failed to process compatible router 'eb356f30-98c9-4641-9f99-2ad91a6a7223'
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 501, in _process_router_update
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     self._process_router_if_compatible(router)
[...]
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib/python2.7/site-packages/oslo_rootwrap/client.py", line 129, in execute
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     res = proxy.run_one_command(cmd, stdin)
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "<string>", line 2, in run_one_command
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent     raise convert_to_error(kind, result)
2017-04-20 14:57:28.604 10262 ERROR neutron.agent.l3.agent NoFilterMatched

The L3 agent was trying to send a signal to a process with PID 14410. ps told us more about it:

root     14410  0.0  0.0 111640  1324 ?        Ss   mars01   3:28 keepalived -P [...]

keepalived is used by the L3 agent for the router HA feature. For each router a VRRP/keepalived process is started to handle the failover in case a node goes down.

So neutron was not authorized to send signals to this process.

The solution

Knowing that the problem was related to a missing authorization in the oslo_rootwrap configuration we did a bit of digging in the configuration files:

$ grep keepalived /usr/share/neutron/rootwrap/*.filters
/usr/share/neutron/rootwrap/l3.filters:keepalived: CommandFilter, keepalived, root
/usr/share/neutron/rootwrap/l3.filters:kill_keepalived: KillFilter, root, /usr/sbin/keepalived, -HUP, -15, -9

The configuration allowed neutron to send signals to /usr/sbin/keepalived processes, but our process was called keepalived, without absolute path. So we added a new configuration do deal with the existing processes:

kill_keepalived_no_path: KillFilter, root, keepalived, -HUP, -15, -9

After a restart the L3 agent started to act as expected again.

Conclusion

Mitaka is a somewhat old realease in OpenStack terms, and we didn't face this problem during upgrades to more recent OpenStack versions.

Knowing how to read python traces and how to dig into OpenStack code is still an interesting skill to possess to be able to understand situations like this one (google didn't help much).

rootwrap usually does its job quite well and this problem gave us the opportunity to better understand how it works and how to deal with its configuration.


Ansible playbooks often contain sensitive information that need to be kept private: passwords, private keys, DNS transfer keys and so on. It becomes a real problem when you have to share the playbooks and their sensitive data with coworkers in a git repository.

To solve this problem ansible provides the ansible-vault tool. It encrypts files using a password:

$ ansible-vault create group_vars/host
New Vault password:
Confirm New Vault password:
EDIT EDIT EDIT
$ ansible-vault edit group_vars/host
Vault password:
UPDATE UPDATE UPDATE

What you commit in your git repository is something that looks like this (only longer):

ANSIBLE_VAULT;1.1;AES256
6661656265653234313962356465316166383...

You then need to use the --ask-vault-pass or --vault-password-file options to unlock the encrypted file when you run your playbook. Nothing complicated, but:

  • what happens if you don't manually run ansible, but instead use an orchestration tool like Jenkins or Ansible Tower?
  • how do you share and store the password with your coworkers in a secure manner?

What to do?

A solution is to use an external tool to store and retrieve the password, for instance pass or HashiCorp Vault.

To do this you need to use a script instead a file with the --vault-password-file option. You also need to tell ansible to always use this file:

  1. Write a script in a vault_pass file. This script should print the ansible-vault password on the standard output:

    #!/bin/sh
    
    # using pass
    pass pocentek.net/ansible/vault
    
    # or using vault
    vault read -field=password secret/pocentek.net/ansible_vault
    
  2. Make the script executable:

    $ chmod +x vault_pass
    
  3. Add the following in your ansible.cfg file:

    [defaults]
    vault_password_file = ./vault_pass
    
  4. Run your playbook:

    ansible-playbook your-playbook.yml
    

Pass or Vault as external tool?

pass is really easy to setup and is my tool of choice for personal projects. When working with several persons it becomes more complicated to use:

  • every user must store the shared password at a predefined path on their local machine
  • if the password must be changed every user must update it locally

vault is more complex to setup but offers some nice advantages:

  • no need for everyone to store the password locally
  • vault supports ACLs. If a user leaves the project, her permissions are revoked and the password updated only once on the vault server
  • password changes are easier to handle and can be done more often

I use LXC on my ubuntu workstation quite often. LXD has been out for a while, and I tested it to see if I could use it as a direct replacement for LXC. And the answer is yes! LXD provides nice management tools that didn't exist in LXC, but the mechanics are the same.

This blog is a recap of what I did to setup a local installation. It assumes you already know what is LXC and how to use it.

Some differences with LXC

  • No more template scripts, LXD uses pre-built images. This has become quite common (think Docker/EC2/OpenStack Glance).
  • LXD runs as a daemon and can be managed remotely. If run locally any user in the lxd group can talk to the daemon. APIs are great.
  • Network management is way simpler, and doesn't require tweaking configuration files.

Install and configure LXD

Ubuntu 16.04 seems to come with LXD installed, but in case it isn't there:

sudo apt install lxd

You can then use the lxd init tool to setup the initial configuration:

sudo lxd init

You will have to answer questions about:

  • The storage back-end, directory or zfs. The zfs back-end is nice. It uses clones and snapshots to optimize performance when creating containers, and consumes less disk space.
  • The initial network.
  • The LXD API access: local only or exposed on a network.

The lxd command manages the daemon, use the lxc command to manage your containers.

Create and access containers

The containers creation is straightforward:

lxc launch ubuntu:16.04 c1

ubuntu:16.04 is the reference to an existing container image. If LXD cannot find it locally, it will download it from a repository (canonical's by default). The image will then be stored locally.

The container will be started after creation. Use the list or info subcommands to get information about the new container.

You will not be able to access the container using SSH by default:

$ ssh ubuntu@10.0.4.242
Permission denied (publickey).

Just like for ubuntu cloud instances the default user doesn't have a password set, and you need to use an SSH key to authenticate. An initial setup needs to be done. Not handy but should only be done once.

To configure your SSH key inside the container use the exec subcommand:

$ lxc exec c1 /bin/bash
root@c1:~# echo "YOU PUBLIC KEY" > /home/ubuntu/.ssh/authorized_keys
root@c1:~# exit
exit

Validate that you can access the container:

$ ssh ubuntu@10.0.4.242
...
ubuntu@c1:~$

Congrats!

Now you can build a new image that contains you SSH key:

$ lxc stop c1
$ lxc publish c1 --alias ubuntu-ssh
$ lxc image list | grep ubuntu-ssh
$ lxc launch ubuntu-ssh c2

What's next

St├ęphane Graber's blog contains a lot a very interesting articles about LXC/LXD.

You can setup DNS resolution in the same way you might have done for LXC.

The next step for me will be testing LXD as OpenStack nova plugin.