DevConf CZ 2020

This year, like every year, I find myself wondering which conferences I should go to. DevConf is definitely high on the list as given its content which is mostly around the upstream projects of Red Hat products, the people that go to it who are open source contributors, the location of Czech republic which makes it one of the cheaper ones around and the fact that I’m not so crazily busy in January as to be able to attend it.

The content

DevConf has several streams of interesting content which can be found here. The majority of my limited time I dedicated to learning more about containers as well AI/ML. The ones I found most interesting however, were the more obscure lectures I chose to attend because of my friends choices. These were lectures with a few topics from non-mainstream subjects to keep me current in the world of open source and Red Hat.

Friday for me was the most informative day as I’ve had to travel most of Sunday, and Saturday seemed to have morphed into a networking session.

Lessons learned

These are the things I want to look up and dive into when I have a few moments of spare time:

As part of the SRE talks, an idea of configuring things like quota and limits has been generally well known but Pod Disruption budgets are something that would reward good OCP tenants with a more stable platform, allowing the specification of safety constraints on pods during operations, such as draining a node for maintenance: https://docs.openshift.com/container-platform/4.2/nodes/pods/nodes-pods-configuring.html. Environments with more sophisticated uptime requirements would benefit from this (e.g. nighttime shutdown of most nodes).

Tools that check manifests or deployment configs in order to educate developers and operations team members as to what a good application deployment looks like https://github.com/app-sre/manifest-bouncer exist. This is something that can benefit communities of operations people that want to improve their SLAs and deployments in order to get the best out of a platform. Ideally, the more robust a manifest or template has been written, the more reliable the application it deploys.

Peeking into your compiler was not quite as big as some of the subjects above but nevertheless an interesting session with Ulrich Drepper and Jakub Jelínek. It talks about the comparison of GCC and other compilers (it emits assembly as a text file), supports multiple languages with a separate front and back end and targets different architectures. It then delves into the specifics of that and how each individual compiler component can be examined. Even though I don’t think I’d soon be using anything from this talk, I have been reminded of the complexity and customisability of not only compilers but also open source in general. This is both a benefit and a curse. Yes, people need to put in the effort to gain a benefit but usually the benefits outweigh the effort.

CodeReadyContainers talks are for developers that want to test their code on a OpenShift-like environment, finally replacing Minishift for v4 with something remotely workable. Bugs such as 30-day certificate expiry of CRC have been fixed, which enables long-running instances of it. Integrations with Windows and MacOS have also been released to enable developers and organiations to use it effectively.

Performance tuning is an interesting subject from many perspectives -from compilers, as mentioned earlier, to applications. PBench is a tool that caters to a simple use case of ensuring that whatever environment a program runs on, the same suite of tools will gather information about it. Well known tools such as libvirt, kernel configuraiton, sos report and less well known tools such as block, stockpile, ara are used to compile a holistic view as to what this application needs. It would be interesting to see how well it integrates with OpenShift or other container environments and in fact Lamda AWS functions and serverless as that comes to saturate the market but that may be a long way away for this tool.

Aritficial Intelligence, and in particular Machine Learning as a subject has a lot to give. Talks around data classification and pruning were particularly informative in finding the right balance between the right output and a faster system. Compromises were made in achieving the targeted outcome by pruning the data that was correlated to each other (e.g. in property age or foundation quality are correlated and only one should be kept) in such a way that the remaining non-correlated data gives the best outcome as a very rigorous principle of Occams Razor applied. AI/ML is a subject that I’m still very new at but interested in learning as I’ve always wanted to design the Matrix and not be part of it.

How to’s and getting started with…

I generally feel like any 30 minute talk during this conference was not worth my time. I think some people would have found them mildly informative but in actuality 20 minute talking and 5 minute question answering was too little to gain a deeper understanding into a complex topic.

However, I have gathered some information on “How to get started with Operators”. Operators can be created using 3 methods: ansible, golang, and helm charts. Operator basics can be investigated through this codeshare https://github.com/cloudflightio/operator-basics.

A topic that I also found could be greatly expanded is managing keys and key servers for network bound disk encryption. Rotation of keys can happen using features from tools such as: tang, clevis, sss.

Words

If anyone wants to play buzzword bingo (it’s a real thing) when the next technology adoption phase arrives,

GitOps was a very prominent word. It was coupled with managing OpenShift or K8s clusters (with ArgoCD as the main example) and having transparent and repeatable processes (heavily featuring operators).

Multi-Cloud was another one. This is related to the ability to switch clouds at will and avoid vendor lock in. This is done by using platforms such as K8s in order to deploy containers and technologies such as Ceph and NooBaa for storage on cloud providers to provide async replication between them. It dives into application use cases for this multi cloud proposal such as re-hosting/lift and shift, re-platforming (keep existing application and build new capabilities with containers), and refactoring (re-writing with a lot of effort from monolithic to microservices architectures).

Container Security: this was a topic featured by none other than Dan Walsh and his team. Moving away from root in containers is explained at https://opensource.com/article/18/3/just-say-no-root-containers. More talks were about dropping capabilities from container runtime and generating SELinux policies for containers automatically using udicia.

Self and time Management

Conferences are always a tiring thing for me. It’s a balancing act of finding things that are worthwhile and worth spending mental energy on. Developer conferences in particular have the additional bonus for the extended time with people that are far cleverer and more accurate than I am. Anyone with impostor syndrome or on the verge of it can testify that although this is eye opening and informative, it’s not boosting anyone’s ego. Therefore, time to decompress and frequent breaks are important. I find that at least 3 longer breaks are needed for this – one every 2.5 hours approximately to break the day into manageable chunks.

Fortunately, non-techie booths, lectures, and people were plentiful. I made my break time more interesting by looking at funky stickers, talking about neurodiversity, talking about how bee’s or hive’s vision could be used as an example to solve complex problems. However, sometimes that doesn’t cut it. Sometimes, you need a quiet table or a badge that says “don’t talk to me”. That was available as part of some conference booths which I felt extremely grateful for. I may have not used it but it was there should I needed to.

Things that I’ve missed and would like to see:

I’ve found that at least on Friday, I wanted to attend at least 3 lectures that were at the same time and on Sunday attend some lectures that I couldn’t due to travel time budgeted. There should be some recordings and slide sharing of these lectures but it still doesn’t beat being there in person to ask questions and network. These are the ones I would have liked to attend but couldn’t:

Will you do it again?

Yes, I think DevConf is a conference that despite of the fact that it is swimming in Red Hat upstream content has a lot of benefits to provide to the ever-learning consultant in me. I would however supplement it with something that has a more rounded view in open source next year. Any suggestions welcome.

SSL and the Openshift API

Adventures trying to use the OpenShift and Kubernetes APIs on RHEL, Windows, and from within a pod.

One of the great features of Kubernetes and OpenShift is its API and the ability to perform any action by making the correct REST call, but as a wise man once said the three great virtues of a programmer are laziness, impatience, and hubris. Making REST calls explicitly and dealing with the results, even in a language like python, is long-winded and hard work so obviously libraries have been that wrap up the OpenShift API to make life easy.

Sadly writing great documentation is really not a virtue that comes naturally to programmers so the OpenShift python libraries take a little figuring out to make work, especially when you start hitting weird errors or working on platforms that are not well tested.

Installation is pretty straightforward, a simple ††

pip install openshift

will install both the openshift python library, and the Kubernetes python library that it builds upon. Alternately on RHEL7 a simple

yum install python2-openshift.noarch

should work. Failing that the code for the openshift library is part of the official openshift release and can be found at https://github.com/openshift/openshift-restclient-python

Once you have it installed using it is simple:

from kubernetes import client, config
from openshift.dynamic import DynamicClient

k8s_client = config.new_client_from_config()
client = DynamicClient(k8s_client)
v1_pod = client.resources.get(api_version ='v1', kind='Pod')                                        
podList = v1_pod.get(namespace = "default")
for pod in podList['items']:
    print("%s %s"%(pod.metadata.name,pod.status.podIP))

which should use the credentials stored in .kube/config (generated by the oc login command) and use them to get all the pods in the clusters default namespace, and list their names and IPs.

Sadly the difference between theory and practice is that in theory they are the same, and in practice they’re not, so what actually happened when I ran that on my development platform, a slightly left field mix of Windows10, python, and mingw (don’t ask, the customer makes the rules) I was greeted with an ugly SSL error message clearly suggesting a certificate problem:

2019-03-19 17:21:30,191 WARNING Retrying 
(Retry(total=2, connect=None, read=None, redirect=None, status=None))
 after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)'))': /version

There is an issue in github that seems to fit, https://github.com/openshift/openshift-restclient-python/issues/198 but no solution.

An experiment with the pure Kubernetes library had the same result.

Several frustrating hours later it appears that there is something strange in how python on windows deals with certificates, quite what would take a lot longer to figure out than I had, certainly longer than it took to figure out the workaround!

That workaround is pretty simple and obvious, you just turn off ssl verification, its not exactly great practice, and should never be done in production, but the technique opens the way to a more flexible approach to authorisation. This leaves our little demo program looking like:

import os
from kubernetes import client, config
from openshift.dynamic import DynamicClient

token = os.getenv("OCP_TOKEN")

config = client.Configuration()
config.host = "https://example.com:8443/"
config.verify_ssl = False
config.api_key = {"authorization": "Bearer " + token}

k8s_client = client.ApiClient(config)
client = DynamicClient(k8s_client)

v1_pod = client.resources.get(api_version ='v1', kind='Pod')
podList = v1_pod.get(namespace = "default")
for pod in podList['items']:
    print("%s %s"%(pod.metadata.name,pod.status.podIP))

Complete with reading the token from an environment variable which you can set with

export OCP_TOKEN=`oc login -t`

And there you are, a simple workaround to an annoying problem.

RHEL

Interestingly while testing out that bit of code I discovered another annoying ssl issue, this time running on RHEL7, where turning off SSL verification leads to the following error:

urllib3.exceptions.SSLError: 
[Errno 2] No such file or directory

This one comes down to the python2-certifi package which seems to point to a certificate file that it should provide itself but doesn’t, and is needed even with verification turned off. This looks to be fixed in the python3 version, but python2 is still the default on RHEL7 so that’s a little annoying. Luckily, this gives us the chance to make out little test program a lot better, by turning verification back on and pointing it to the system certs provided by the ca-certificates package:

import os
from kubernetes import client, config
from openshift.dynamic import DynamicClient

token = os.getenv("OCP_TOKEN")

k8sconfig = client.Configuration()
config.host = "https://example.com:8443/"
k8sconfig.verify_ssl = True
k8sconfig.ssl_ca_cert = "/etc/pki/tls/certs/ca-bundle.crt"
k8sconfig.api_key = {"authorization": "Bearer " + token}

k8s_client = client.ApiClient(k8sconfig)
client = DynamicClient(k8s_client)
v1_pod = client.resources.get(api_version ='v1', kind='Pod')
mypod = v1_pod.get(namespace = "default")
for pod in mypod['items']:
    print("%s %s"%(pod.metadata.name,pod.status.podIP))

And there we have it, the different variations of the program we need to get it working on different client platforms.

As a final word the kubernetes python library is similar but a little different, the same program would look more like:

import os
from kubernetes import client, config

k8s_client = config.new_client_from_config()

token = os.getenv("OCP_TOKEN")

k8sconfig = client.Configuration()
k8sconfig.host = "https://example.com:8443"
k8sconfig.verify_ssl = True
k8sconfig.ssl_ca_cert = "/etc/pki/tls/certs/ca-bundle.crt"
k8sconfig.api_key = {"authorization": "Bearer " + token}

k8sclient = client.ApiClient(k8sconfig)
v1 = client.CoreV1Api(k8sclient)
podList = v1.list_namespaced_pod(namespace="default")
for pod in podList.items:
    print("%s"%(pod.metadata.name))

Postscript

Just for fun, if you wanted to run this from within an OpenShift container there are a couple of changes that are worth making. There is a service that gets set up automatically which allows pods to contact the API server at https://kubernetes.default.io and the certs required to talk to it are injected into all pods at run time. So a our little test program becomes:

import os
from kubernetes import client, config
from openshift.dynamic import DynamicClient

token = os.getenv("OCP_TOKEN")

k8sconfig = client.Configuration()
config.host = "https://kubernetes.default.io"
k8sconfig.verify_ssl = True
k8sconfig.ssl_ca_cert = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
k8sconfig.api_key = {"authorization": "Bearer " + token}

k8s_client = client.ApiClient(k8sconfig)
client = DynamicClient(k8s_client)
v1_pod = client.resources.get(api_version ='v1', kind='Pod')
mypod = v1_pod.get(namespace = "default")
for pod in mypod['items']:
    print("%s %s"%(pod.metadata.name,pod.status.podIP))

OpenShift Container Platform 4 on Azure using Installer Provisioned Infrastructure

Overview

This post is entirely for fun. I am trying a developer preview product – the OpenShift Container Platform 4 (OCP 4) Installer Provisioned Infrastructure (IPI) on Microsoft Azure.

I really didn’t want the day to end in blood, sweat, and tears so I went through as much documentation as I could related to OCP4.1 about AWS, Azure, and generally some code. I created a pay as you go account an purchased a domain name. For now let’s call it example.com.

Blood, Sweat, and Tears (or not)

The first thing I created in Azure was a resource group called openshift4-azure. I created a public DNS zone with a DNS name  in that resource group that was delegated for management to the Azure DNS servers. This is to manage the entries that the OCP 4. installer will need to create in order to manage traffic into the cluster.

I then created my local golang environment. I needed to create a golang environment and path https://golang.org/doc/install. This was to compile the installer for Azure. The binaries are not readily available yet.  To test that this was working env |grep GOPATH.  My path is $HOME/go.

I then forked the openshift/installer repository and cloned it in the go path: $HOME/go/src/github.com/openshift/ . I added the correct upstream git remote add upstream https://github.com/openshift/installer.git to my fork in case I made code/documentation changes for PRs. To build the binary I needed I run ./hack/build.sh from the installer. This created the installer in the bin  folder.

I followed the instructions at https://github.com/openshift/installer/tree/master/docs/user/azure/install.md to clone an image for CoreOS in my region. In every region where I want to create a cluster I need to copy the same image. I wanted to run these repeatedly so I downloaded the Azure CLI  from https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-yum?view=azure-cli-latest. As I’m running this in uksouth so these are the commands I needed to run:

export VHD_NAME=rhcos-410.8.20190504.0-azure.vhd
az storage account create --location uksouth --name ckocp4storage --kind StorageV2 --resource-group openshift-azure
az storage container create --name vhd --account-name ckocp4storage
az group create --location uksouth --name rhcos_images
ACCOUNT_KEY=$(az storage account keys list --account-name ckocp4storage --resource-group openshift-azure --query "[0].value" -o tsv)
az storage blob copy start --account-name "ckocp4storage" --account-key "$ACCOUNT_KEY" --destination-blob "$VHD_NAME" --destination-container vhd --source-uri "https://openshifttechpreview.blob.core.windows.net/rhcos/$VHD_NAME"

To create a unique storage account it took me a few tries. I think it needs to be unique across a region.

It is recommended to use Premium_LRS sku. To get premium storage in Azure in a PAYG account,  I needed to enable the right subscription in the storage provider in PayAsYouGo subscription -> Resource Providers. This needed to be registered. Before creating the image, the storage blob needs to finish creating otherwise you get the following error:

Cannot import source blob https://ckocp4storage.blob.core.windows.net/vhd/rhcos-410.8.20190504.0-azure.vhd since it has not been completely copied yet. Copy status of the blob is CopyPending.
export RHCOS_VHD=$(az storage blob url --account-name ckocp4storage -c vhd --name "$VHD_NAME" -o tsv)
az image create --resource-group rhcos_images --name rhcostestimage --os-type Linux --storage-sku Premium_LRS --source "$RHCOS_VHD" --location uksouth

I created a service principal for my installation and copied the following somewhere safe:

 az​ ad sp create-for-rbac --name   openshift4azure
{
"appId": "serviceprincipal",
"displayName": "openshift4azure",
"name": "http://openshift4azure",
"password": serviceprincipalpassword",
"tenant": "tenant id"
}

And gave it the following access:

az role assignment create --assignee serviceprincipal --role "User Access Administrator"
az role assignment create --assignee serviceprincipal --role "Contributor"

I then got my oc client and pull secret as described at https://cloud.redhat.com/openshift/install/azure/user-provisioned.

I tried my first Azure IPI OCP4 install and the first thing that I got was the following.

openshift-install create cluster
? SSH Public Key $HOME/.ssh/id_rsa.pub
? azure subscription id yyyy-xxxx-nnnn-bbbb-fffffff
? azure tenant id yyy-xxxx-nnnn-bbbb-nnnnnn
? azure service principal client id yyy-xxxx-nnnn-bbbb-ccccccccc
? azure service principal client secret [? for help] ************************************
INFO Saving user credentials to "$HOME/.azure/osServicePrincipal.json" 
? Region uksouth
? Base Domain example.com
? Cluster Name attempt1
? Pull Secret [? for help] ***********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
INFO Creating infrastructure resources... 
^CERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/bootstrap/main.tf line 117, in resource "azurerm_virtual_machine" "bootstrap": 
ERROR 117: resource "azurerm_virtual_machine" "bootstrap" { 
ERROR 
ERROR 
ERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/master/master.tf line 44, in resource "azurerm_virtual_machine" "master": 
ERROR 44: resource "azurerm_virtual_machine" "master" { 
ERROR 
ERROR 
ERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/master/master.tf line 44, in resource "azurerm_virtual_machine" "master": 
ERROR 44: resource "azurerm_virtual_machine" "master" { 
ERROR 
ERROR 
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform

Standard PAYG account does not allow for the amount of resources that IPI will create. It requires more than the 10 compute resources available so I needed to increase compute quota to allow for creation:

Resource Manager, UKSOUTH, DSv2 Series from 10 to 100
Resource Manager, UKSOUTH, DSv3 Series from 10 to 100

I needed to export the environment variable for the install image for RHCOS which I found from the account storage account blob:

export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE="/resourceGroups/rhcos_images/providers/Microsoft.Compute/images/rhcostestimage"

I destroyed the stack oc destroy cluster --dir=cluster-dir to try again and watched with glee as my Azure attempt-1 resource group diminished. It was then time for attempt 2 for which I also passed the Azure authentication credentials location in a json file by exporting this variable AZURE_AUTH_LOCATION=creds.json. Baaaad idea. The installer overwrote my credentials location. It’s a good thing I had a copy and didn’t particularly care.

Attempt 2 seems to have worked. I have an operational cluster. All my operators are running in a good state (not degraded and not progressing):

bin]$ ~/bin/oc get co
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.okd-2019-06-25-110619   True        False         False      46m
cloud-credential                           4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
cluster-autoscaler                         4.2.0-0.okd-2019-06-25-110619   True        False         False      65m
console                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      50m
dns                                        4.2.0-0.okd-2019-06-25-110619   True        False         False      65m
image-registry                             4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
ingress                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      53m
kube-apiserver                             4.2.0-0.okd-2019-06-25-110619   True        False         False      61m
kube-controller-manager                    4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
kube-scheduler                             4.2.0-0.okd-2019-06-25-110619   True        False         False      61m
machine-api                                4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
machine-config                             4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
marketplace                                4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
monitoring                                 4.2.0-0.okd-2019-06-25-110619   True        False         False      52m
network                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
node-tuning                                4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
openshift-apiserver                        4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
openshift-controller-manager               4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
openshift-samples                          4.2.0-0.okd-2019-06-25-110619   True        False         False      53m
operator-lifecycle-manager                 4.2.0-0.okd-2019-06-25-110619   True        False         False      63m
operator-lifecycle-manager-catalog         4.2.0-0.okd-2019-06-25-110619   True        False         False      63m
operator-lifecycle-manager-packageserver   4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
service-ca                                 4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
service-catalog-apiserver                  4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
service-catalog-controller-manager         4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
storage                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
support                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      66m

 

Conclusion

For a first attempt on a developer preview things went very well. I’ve trolled through the Azure logs and found things like access role issues so I still don’t know if I’ve made a mistake on my Service Principal allocation. I think some better error handling and messages would help with the installer. I’d hate to see things like Machine Sets not being able to be expanded because my IAM is wrong and I didn’t know about it. Ofcourse general things like installation behind proxy, bring your own DNS or SecurityGroups/Networking and better publicising of the CoreOS images would also help.

I’m hoping to find out more as I use the cluster over the next few days. If you haven’t yet, try the installer on Azure and let me know what you think:

  1. To get started, visit try.openshift.com and click on “Get Started”.
  2. Log in or create a Red Hat account and follow the instructions for setting up your first cluster on Azure.

References

OpenShift – From Design and Deploy to Deliver and Transform: Optimising Distributed Teams with Agile Practices

Overview

Frequently when I’m on site I am not directly asked but I am expected to provide answers to my customers how to get the best use of a technology. In this post I’m examining a recent scenario around providing structure around deploying OpenShift in order to provide a collaboration environment that would aide the use of this technology. We were also deploying OpenShift but writing about OpenShift deployment is a well covered subject across the board.

Background

I’ve recently visited a customer that wished to containerise the world and provide to their developer community a Container As A Service (CaaS) – a single enterprise kubernentes cluster that would allow groups of developers to develop and deploy, as well as an Enterprise Kubernetes cluster As A Service (KaaS) offering – a series of clusters that would be ordered on demand by different management chains and in different security groups. Although I think the first one is easy to do and would fit many use cases, the second one is definitely more complex; big vendors and service companies still struggle in updating and maintaining multiple clusters of Kubernetes distributions especially when those distributions have massively different configurations.

When I first went on site, I realised that I was in London and my primary contacts were working remotely. This is quite uncommon for consulting engagements but it was a common theme for the organisation I was working with: distributed teams with minimal travel budgets. I need to pick my fights as to what I can change, so I set course to meet my primary contacts in a central European city that would suit them to organise a series of workshops that would help us agree on ways of working, tools, technologies, architecture etc.  Even if I was working on this project remotely for a few weeks this was a major breakthrough to the pace of work. This was a highly effective method of getting to know and trusting each other. Other than time and experience in the field, a few techniques that I used played major role in that too.

Using Open Practice Library practices in a distributed team

At the time, I had recently finished a precursor of the current  DevOps Culture and Practice Enablement – DO500  course  and I was eager to put what I had learned in practice. I thought that these methods are always effective and able to bring people together talking about the right things.

When I arrived in the mutually agreed location,  I was given a list of objectives to help the organisation deploy OpenShift Container Platform (OCP) as a service. We started first discussing why we were trying to achieve what we were trying to achieve and what success would look like using a Start At The End method. This was very useful to give us context and direction as we wanted to make sure that the business would get the most of this. It made us focus on what the end goal is: user (developer) satisfaction by creating seamless integration with current customer systems, ease of testability and  and engagement with the product.

We then followed on agreeing a list of things we would continue doing to make sure that collaboration and focus doesn’t wane; we built our foundation:

  • We decided to use pair programming techniques whereby having two people delivering a feature and many when learning something new in the platform. When using this in delivering features to the platform we ensured that knowledge is distributed across the team. This also enabled a constant channel of communication being open between distributed team members. Old fashioned video conferencing and screen sharing was sufficient at the time but we later explored tmux configuration for shared command line access to machines. Anything beyond that was a struggle regarding pair programming tooling as the environment was quite locked down to allow the live share functionaly of VSCode or something similar.
  • It was important for us to ensure that everything we did was repeatable so all the changes we wanted to do whether it was a configuration change or a build change or deploying new servers we codified first. We mainly used Ansible playbooks or Jenkins pipelines and followed the everything as code practice. We used git  which made our code versionable  and when we released a  new stable version of the platform we tagged that to indicate that point. We could always revert to a working version. This helped us a few times especially at the beginning when we needed to spin a new cluster very quickly to test new functionality.
  • We agreed on a set of rules we’d all abide by: including core hours of work, remote-scrum times,  and potentially a sense of humour. We wrote our social contract and signed it and then transcribed it to our wiki. This gave us an understanding of when  and how it was best to collaborate even with our different cultural backgrounds and timezones.

I’ve seen a few of these deployments in the past and one of the main success or failure criteria that I have seen is development and business engagement. Therefore, it was important to ensure that developers were engaged as much as possible to use this platform to test and develop.

Tweaks

From the initial set of practices we used to collaborate we found that they worked quite well but needed a few modifications and additions. Below are things that we changed or would have liked to have changed:

  • OpenShift  and Kubernetes in general is a fast moving platform, while learning about all the new components, integrations and modifications, it was important to educate our users too. We set up time during our days to absorb new material in the community by reading blog posts, following tutorials and adapting some of it for consumption of our users. This is something we then added to our social contract.
  • Empathy mapping and user interviews for increasing user engagement was something we were all interested in and it was a key factor in getting the platform moving forward. We wanted to ensure that new users of any container technology would first try and potentially succeed with OpenShift. We came up with a list of teams that were aiming to create cloud-native workloads or could benefit from a modernisation and came up with a list of questions to understand their current frustrations when developing and constraints. This gave us a direct line with our main users after we started enabling features for the platform.
  • Using principles such as everything as code is great 80% of the time for everything that is well understood. However, there was a good 20% that the value of automating something needed to be proven by testing a change worked manually first. This 20% gap was later minimised by introducing automated tests as part of building a new cluster that would give us an answer as to if our changes were sane and worked.
  • Not all scrum events worked well in this distributed team. Our daily standup ended up in a debugging session more often than not. Although this was useful I feel like we were missing the point in focusing time a bit better. I understand why too; the setting was exactly the same we were on a video call all day to each other.  It improved a little bit after one suggestion: to stand up during our phone call. However, I feel like it would have been much easier to have a scrum master to enable us.
  • Visualising our workload was something that we had to do using digital tools like wikis and digital kanban boards. However, having physical copies of our social contract and actual boards to write on would have helped massively in actually re-focusing every time we looked around or went for a coffee. Space was something that wouldn’t allow us to do that but I believe that it would bring even better results.

Next Time

These are the things I would do differently next time.

I would love to push that initial collaboration meeting a few weeks forward. This was the catalyst in actually working better together. It created connections that are so difficult to forge over the phone or video conferencing and a lot of trust.

Product owner involvement was not as high as I expected and delegated to the team. Although this was good as it gave us more power, creating the initial connections to the developers was slow and frustrating. If I were to do this again I would stress even more how important time with the team and the developers would be.

Takeaways

So far, with the practices used above, I’ve seen not only successful deployment and use of OpenShift but a clear shift as to how people talk to each other and want to contribute to this project. Whether it’s a small company, or a global supertanker of an organisation, everyone wants to improve their ways of working. This was keenly felt here. These practices are easy to try but they take discipline and good humour to continue especially in the context of widely distributed teams. I would thoroughly recommend trying them and reporting back on which ones you picked and why.

References

If you want to learn more about practices used in this blog post visit https://openpracticelibrary.com.

If you are interested in OpenShift and learning more about the technology visit https://docs.openshift.com.

If you are interested in automation around self-created IaaS and OpenShift  follow the CASL project. This was used as an upstream OpenShift deployment tool with pre-and post installation hooks and upstream changes were made to ensure the the community would  be able to work around the customer’s required changes.