DevConf CZ 2020

This year, like every year, I find myself wondering which conferences I should go to. DevConf is definitely high on the list as given its content which is mostly around the upstream projects of Red Hat products, the people that go to it who are open source contributors, the location of Czech republic which makes it one of the cheaper ones around and the fact that I’m not so crazily busy in January as to be able to attend it.

The content

DevConf has several streams of interesting content which can be found here. The majority of my limited time I dedicated to learning more about containers as well AI/ML. The ones I found most interesting however, were the more obscure lectures I chose to attend because of my friends choices. These were lectures with a few topics from non-mainstream subjects to keep me current in the world of open source and Red Hat.

Friday for me was the most informative day as I’ve had to travel most of Sunday, and Saturday seemed to have morphed into a networking session.

Lessons learned

These are the things I want to look up and dive into when I have a few moments of spare time:

As part of the SRE talks, an idea of configuring things like quota and limits has been generally well known but Pod Disruption budgets are something that would reward good OCP tenants with a more stable platform, allowing the specification of safety constraints on pods during operations, such as draining a node for maintenance: https://docs.openshift.com/container-platform/4.2/nodes/pods/nodes-pods-configuring.html. Environments with more sophisticated uptime requirements would benefit from this (e.g. nighttime shutdown of most nodes).

Tools that check manifests or deployment configs in order to educate developers and operations team members as to what a good application deployment looks like https://github.com/app-sre/manifest-bouncer exist. This is something that can benefit communities of operations people that want to improve their SLAs and deployments in order to get the best out of a platform. Ideally, the more robust a manifest or template has been written, the more reliable the application it deploys.

Peeking into your compiler was not quite as big as some of the subjects above but nevertheless an interesting session with Ulrich Drepper and Jakub Jelínek. It talks about the comparison of GCC and other compilers (it emits assembly as a text file), supports multiple languages with a separate front and back end and targets different architectures. It then delves into the specifics of that and how each individual compiler component can be examined. Even though I don’t think I’d soon be using anything from this talk, I have been reminded of the complexity and customisability of not only compilers but also open source in general. This is both a benefit and a curse. Yes, people need to put in the effort to gain a benefit but usually the benefits outweigh the effort.

CodeReadyContainers talks are for developers that want to test their code on a OpenShift-like environment, finally replacing Minishift for v4 with something remotely workable. Bugs such as 30-day certificate expiry of CRC have been fixed, which enables long-running instances of it. Integrations with Windows and MacOS have also been released to enable developers and organiations to use it effectively.

Performance tuning is an interesting subject from many perspectives -from compilers, as mentioned earlier, to applications. PBench is a tool that caters to a simple use case of ensuring that whatever environment a program runs on, the same suite of tools will gather information about it. Well known tools such as libvirt, kernel configuraiton, sos report and less well known tools such as block, stockpile, ara are used to compile a holistic view as to what this application needs. It would be interesting to see how well it integrates with OpenShift or other container environments and in fact Lamda AWS functions and serverless as that comes to saturate the market but that may be a long way away for this tool.

Aritficial Intelligence, and in particular Machine Learning as a subject has a lot to give. Talks around data classification and pruning were particularly informative in finding the right balance between the right output and a faster system. Compromises were made in achieving the targeted outcome by pruning the data that was correlated to each other (e.g. in property age or foundation quality are correlated and only one should be kept) in such a way that the remaining non-correlated data gives the best outcome as a very rigorous principle of Occams Razor applied. AI/ML is a subject that I’m still very new at but interested in learning as I’ve always wanted to design the Matrix and not be part of it.

How to’s and getting started with…

I generally feel like any 30 minute talk during this conference was not worth my time. I think some people would have found them mildly informative but in actuality 20 minute talking and 5 minute question answering was too little to gain a deeper understanding into a complex topic.

However, I have gathered some information on “How to get started with Operators”. Operators can be created using 3 methods: ansible, golang, and helm charts. Operator basics can be investigated through this codeshare https://github.com/cloudflightio/operator-basics.

A topic that I also found could be greatly expanded is managing keys and key servers for network bound disk encryption. Rotation of keys can happen using features from tools such as: tang, clevis, sss.

Words

If anyone wants to play buzzword bingo (it’s a real thing) when the next technology adoption phase arrives,

GitOps was a very prominent word. It was coupled with managing OpenShift or K8s clusters (with ArgoCD as the main example) and having transparent and repeatable processes (heavily featuring operators).

Multi-Cloud was another one. This is related to the ability to switch clouds at will and avoid vendor lock in. This is done by using platforms such as K8s in order to deploy containers and technologies such as Ceph and NooBaa for storage on cloud providers to provide async replication between them. It dives into application use cases for this multi cloud proposal such as re-hosting/lift and shift, re-platforming (keep existing application and build new capabilities with containers), and refactoring (re-writing with a lot of effort from monolithic to microservices architectures).

Container Security: this was a topic featured by none other than Dan Walsh and his team. Moving away from root in containers is explained at https://opensource.com/article/18/3/just-say-no-root-containers. More talks were about dropping capabilities from container runtime and generating SELinux policies for containers automatically using udicia.

Self and time Management

Conferences are always a tiring thing for me. It’s a balancing act of finding things that are worthwhile and worth spending mental energy on. Developer conferences in particular have the additional bonus for the extended time with people that are far cleverer and more accurate than I am. Anyone with impostor syndrome or on the verge of it can testify that although this is eye opening and informative, it’s not boosting anyone’s ego. Therefore, time to decompress and frequent breaks are important. I find that at least 3 longer breaks are needed for this – one every 2.5 hours approximately to break the day into manageable chunks.

Fortunately, non-techie booths, lectures, and people were plentiful. I made my break time more interesting by looking at funky stickers, talking about neurodiversity, talking about how bee’s or hive’s vision could be used as an example to solve complex problems. However, sometimes that doesn’t cut it. Sometimes, you need a quiet table or a badge that says “don’t talk to me”. That was available as part of some conference booths which I felt extremely grateful for. I may have not used it but it was there should I needed to.

Things that I’ve missed and would like to see:

I’ve found that at least on Friday, I wanted to attend at least 3 lectures that were at the same time and on Sunday attend some lectures that I couldn’t due to travel time budgeted. There should be some recordings and slide sharing of these lectures but it still doesn’t beat being there in person to ask questions and network. These are the ones I would have liked to attend but couldn’t:

Will you do it again?

Yes, I think DevConf is a conference that despite of the fact that it is swimming in Red Hat upstream content has a lot of benefits to provide to the ever-learning consultant in me. I would however supplement it with something that has a more rounded view in open source next year. Any suggestions welcome.

OpenShift Container Platform 4 on Azure using Installer Provisioned Infrastructure

Overview

This post is entirely for fun. I am trying a developer preview product – the OpenShift Container Platform 4 (OCP 4) Installer Provisioned Infrastructure (IPI) on Microsoft Azure.

I really didn’t want the day to end in blood, sweat, and tears so I went through as much documentation as I could related to OCP4.1 about AWS, Azure, and generally some code. I created a pay as you go account an purchased a domain name. For now let’s call it example.com.

Blood, Sweat, and Tears (or not)

The first thing I created in Azure was a resource group called openshift4-azure. I created a public DNS zone with a DNS name  in that resource group that was delegated for management to the Azure DNS servers. This is to manage the entries that the OCP 4. installer will need to create in order to manage traffic into the cluster.

I then created my local golang environment. I needed to create a golang environment and path https://golang.org/doc/install. This was to compile the installer for Azure. The binaries are not readily available yet.  To test that this was working env |grep GOPATH.  My path is $HOME/go.

I then forked the openshift/installer repository and cloned it in the go path: $HOME/go/src/github.com/openshift/ . I added the correct upstream git remote add upstream https://github.com/openshift/installer.git to my fork in case I made code/documentation changes for PRs. To build the binary I needed I run ./hack/build.sh from the installer. This created the installer in the bin  folder.

I followed the instructions at https://github.com/openshift/installer/tree/master/docs/user/azure/install.md to clone an image for CoreOS in my region. In every region where I want to create a cluster I need to copy the same image. I wanted to run these repeatedly so I downloaded the Azure CLI  from https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-yum?view=azure-cli-latest. As I’m running this in uksouth so these are the commands I needed to run:

export VHD_NAME=rhcos-410.8.20190504.0-azure.vhd
az storage account create --location uksouth --name ckocp4storage --kind StorageV2 --resource-group openshift-azure
az storage container create --name vhd --account-name ckocp4storage
az group create --location uksouth --name rhcos_images
ACCOUNT_KEY=$(az storage account keys list --account-name ckocp4storage --resource-group openshift-azure --query "[0].value" -o tsv)
az storage blob copy start --account-name "ckocp4storage" --account-key "$ACCOUNT_KEY" --destination-blob "$VHD_NAME" --destination-container vhd --source-uri "https://openshifttechpreview.blob.core.windows.net/rhcos/$VHD_NAME"

To create a unique storage account it took me a few tries. I think it needs to be unique across a region.

It is recommended to use Premium_LRS sku. To get premium storage in Azure in a PAYG account,  I needed to enable the right subscription in the storage provider in PayAsYouGo subscription -> Resource Providers. This needed to be registered. Before creating the image, the storage blob needs to finish creating otherwise you get the following error:

Cannot import source blob https://ckocp4storage.blob.core.windows.net/vhd/rhcos-410.8.20190504.0-azure.vhd since it has not been completely copied yet. Copy status of the blob is CopyPending.
export RHCOS_VHD=$(az storage blob url --account-name ckocp4storage -c vhd --name "$VHD_NAME" -o tsv)
az image create --resource-group rhcos_images --name rhcostestimage --os-type Linux --storage-sku Premium_LRS --source "$RHCOS_VHD" --location uksouth

I created a service principal for my installation and copied the following somewhere safe:

 az​ ad sp create-for-rbac --name   openshift4azure
{
"appId": "serviceprincipal",
"displayName": "openshift4azure",
"name": "http://openshift4azure",
"password": serviceprincipalpassword",
"tenant": "tenant id"
}

And gave it the following access:

az role assignment create --assignee serviceprincipal --role "User Access Administrator"
az role assignment create --assignee serviceprincipal --role "Contributor"

I then got my oc client and pull secret as described at https://cloud.redhat.com/openshift/install/azure/user-provisioned.

I tried my first Azure IPI OCP4 install and the first thing that I got was the following.

openshift-install create cluster
? SSH Public Key $HOME/.ssh/id_rsa.pub
? azure subscription id yyyy-xxxx-nnnn-bbbb-fffffff
? azure tenant id yyy-xxxx-nnnn-bbbb-nnnnnn
? azure service principal client id yyy-xxxx-nnnn-bbbb-ccccccccc
? azure service principal client secret [? for help] ************************************
INFO Saving user credentials to "$HOME/.azure/osServicePrincipal.json" 
? Region uksouth
? Base Domain example.com
? Cluster Name attempt1
? Pull Secret [? for help] ***********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
INFO Creating infrastructure resources... 
^CERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/bootstrap/main.tf line 117, in resource "azurerm_virtual_machine" "bootstrap": 
ERROR 117: resource "azurerm_virtual_machine" "bootstrap" { 
ERROR 
ERROR 
ERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/master/master.tf line 44, in resource "azurerm_virtual_machine" "master": 
ERROR 44: resource "azurerm_virtual_machine" "master" { 
ERROR 
ERROR 
ERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/master/master.tf line 44, in resource "azurerm_virtual_machine" "master": 
ERROR 44: resource "azurerm_virtual_machine" "master" { 
ERROR 
ERROR 
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform

Standard PAYG account does not allow for the amount of resources that IPI will create. It requires more than the 10 compute resources available so I needed to increase compute quota to allow for creation:

Resource Manager, UKSOUTH, DSv2 Series from 10 to 100
Resource Manager, UKSOUTH, DSv3 Series from 10 to 100

I needed to export the environment variable for the install image for RHCOS which I found from the account storage account blob:

export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE="/resourceGroups/rhcos_images/providers/Microsoft.Compute/images/rhcostestimage"

I destroyed the stack oc destroy cluster --dir=cluster-dir to try again and watched with glee as my Azure attempt-1 resource group diminished. It was then time for attempt 2 for which I also passed the Azure authentication credentials location in a json file by exporting this variable AZURE_AUTH_LOCATION=creds.json. Baaaad idea. The installer overwrote my credentials location. It’s a good thing I had a copy and didn’t particularly care.

Attempt 2 seems to have worked. I have an operational cluster. All my operators are running in a good state (not degraded and not progressing):

bin]$ ~/bin/oc get co
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.okd-2019-06-25-110619   True        False         False      46m
cloud-credential                           4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
cluster-autoscaler                         4.2.0-0.okd-2019-06-25-110619   True        False         False      65m
console                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      50m
dns                                        4.2.0-0.okd-2019-06-25-110619   True        False         False      65m
image-registry                             4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
ingress                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      53m
kube-apiserver                             4.2.0-0.okd-2019-06-25-110619   True        False         False      61m
kube-controller-manager                    4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
kube-scheduler                             4.2.0-0.okd-2019-06-25-110619   True        False         False      61m
machine-api                                4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
machine-config                             4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
marketplace                                4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
monitoring                                 4.2.0-0.okd-2019-06-25-110619   True        False         False      52m
network                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
node-tuning                                4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
openshift-apiserver                        4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
openshift-controller-manager               4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
openshift-samples                          4.2.0-0.okd-2019-06-25-110619   True        False         False      53m
operator-lifecycle-manager                 4.2.0-0.okd-2019-06-25-110619   True        False         False      63m
operator-lifecycle-manager-catalog         4.2.0-0.okd-2019-06-25-110619   True        False         False      63m
operator-lifecycle-manager-packageserver   4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
service-ca                                 4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
service-catalog-apiserver                  4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
service-catalog-controller-manager         4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
storage                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
support                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      66m

 

Conclusion

For a first attempt on a developer preview things went very well. I’ve trolled through the Azure logs and found things like access role issues so I still don’t know if I’ve made a mistake on my Service Principal allocation. I think some better error handling and messages would help with the installer. I’d hate to see things like Machine Sets not being able to be expanded because my IAM is wrong and I didn’t know about it. Ofcourse general things like installation behind proxy, bring your own DNS or SecurityGroups/Networking and better publicising of the CoreOS images would also help.

I’m hoping to find out more as I use the cluster over the next few days. If you haven’t yet, try the installer on Azure and let me know what you think:

  1. To get started, visit try.openshift.com and click on “Get Started”.
  2. Log in or create a Red Hat account and follow the instructions for setting up your first cluster on Azure.

References

OpenShift – From Design and Deploy to Deliver and Transform: Optimising Distributed Teams with Agile Practices

Overview

Frequently when I’m on site I am not directly asked but I am expected to provide answers to my customers how to get the best use of a technology. In this post I’m examining a recent scenario around providing structure around deploying OpenShift in order to provide a collaboration environment that would aide the use of this technology. We were also deploying OpenShift but writing about OpenShift deployment is a well covered subject across the board.

Background

I’ve recently visited a customer that wished to containerise the world and provide to their developer community a Container As A Service (CaaS) – a single enterprise kubernentes cluster that would allow groups of developers to develop and deploy, as well as an Enterprise Kubernetes cluster As A Service (KaaS) offering – a series of clusters that would be ordered on demand by different management chains and in different security groups. Although I think the first one is easy to do and would fit many use cases, the second one is definitely more complex; big vendors and service companies still struggle in updating and maintaining multiple clusters of Kubernetes distributions especially when those distributions have massively different configurations.

When I first went on site, I realised that I was in London and my primary contacts were working remotely. This is quite uncommon for consulting engagements but it was a common theme for the organisation I was working with: distributed teams with minimal travel budgets. I need to pick my fights as to what I can change, so I set course to meet my primary contacts in a central European city that would suit them to organise a series of workshops that would help us agree on ways of working, tools, technologies, architecture etc.  Even if I was working on this project remotely for a few weeks this was a major breakthrough to the pace of work. This was a highly effective method of getting to know and trusting each other. Other than time and experience in the field, a few techniques that I used played major role in that too.

Using Open Practice Library practices in a distributed team

At the time, I had recently finished a precursor of the current  DevOps Culture and Practice Enablement – DO500  course  and I was eager to put what I had learned in practice. I thought that these methods are always effective and able to bring people together talking about the right things.

When I arrived in the mutually agreed location,  I was given a list of objectives to help the organisation deploy OpenShift Container Platform (OCP) as a service. We started first discussing why we were trying to achieve what we were trying to achieve and what success would look like using a Start At The End method. This was very useful to give us context and direction as we wanted to make sure that the business would get the most of this. It made us focus on what the end goal is: user (developer) satisfaction by creating seamless integration with current customer systems, ease of testability and  and engagement with the product.

We then followed on agreeing a list of things we would continue doing to make sure that collaboration and focus doesn’t wane; we built our foundation:

  • We decided to use pair programming techniques whereby having two people delivering a feature and many when learning something new in the platform. When using this in delivering features to the platform we ensured that knowledge is distributed across the team. This also enabled a constant channel of communication being open between distributed team members. Old fashioned video conferencing and screen sharing was sufficient at the time but we later explored tmux configuration for shared command line access to machines. Anything beyond that was a struggle regarding pair programming tooling as the environment was quite locked down to allow the live share functionaly of VSCode or something similar.
  • It was important for us to ensure that everything we did was repeatable so all the changes we wanted to do whether it was a configuration change or a build change or deploying new servers we codified first. We mainly used Ansible playbooks or Jenkins pipelines and followed the everything as code practice. We used git  which made our code versionable  and when we released a  new stable version of the platform we tagged that to indicate that point. We could always revert to a working version. This helped us a few times especially at the beginning when we needed to spin a new cluster very quickly to test new functionality.
  • We agreed on a set of rules we’d all abide by: including core hours of work, remote-scrum times,  and potentially a sense of humour. We wrote our social contract and signed it and then transcribed it to our wiki. This gave us an understanding of when  and how it was best to collaborate even with our different cultural backgrounds and timezones.

I’ve seen a few of these deployments in the past and one of the main success or failure criteria that I have seen is development and business engagement. Therefore, it was important to ensure that developers were engaged as much as possible to use this platform to test and develop.

Tweaks

From the initial set of practices we used to collaborate we found that they worked quite well but needed a few modifications and additions. Below are things that we changed or would have liked to have changed:

  • OpenShift  and Kubernetes in general is a fast moving platform, while learning about all the new components, integrations and modifications, it was important to educate our users too. We set up time during our days to absorb new material in the community by reading blog posts, following tutorials and adapting some of it for consumption of our users. This is something we then added to our social contract.
  • Empathy mapping and user interviews for increasing user engagement was something we were all interested in and it was a key factor in getting the platform moving forward. We wanted to ensure that new users of any container technology would first try and potentially succeed with OpenShift. We came up with a list of teams that were aiming to create cloud-native workloads or could benefit from a modernisation and came up with a list of questions to understand their current frustrations when developing and constraints. This gave us a direct line with our main users after we started enabling features for the platform.
  • Using principles such as everything as code is great 80% of the time for everything that is well understood. However, there was a good 20% that the value of automating something needed to be proven by testing a change worked manually first. This 20% gap was later minimised by introducing automated tests as part of building a new cluster that would give us an answer as to if our changes were sane and worked.
  • Not all scrum events worked well in this distributed team. Our daily standup ended up in a debugging session more often than not. Although this was useful I feel like we were missing the point in focusing time a bit better. I understand why too; the setting was exactly the same we were on a video call all day to each other.  It improved a little bit after one suggestion: to stand up during our phone call. However, I feel like it would have been much easier to have a scrum master to enable us.
  • Visualising our workload was something that we had to do using digital tools like wikis and digital kanban boards. However, having physical copies of our social contract and actual boards to write on would have helped massively in actually re-focusing every time we looked around or went for a coffee. Space was something that wouldn’t allow us to do that but I believe that it would bring even better results.

Next Time

These are the things I would do differently next time.

I would love to push that initial collaboration meeting a few weeks forward. This was the catalyst in actually working better together. It created connections that are so difficult to forge over the phone or video conferencing and a lot of trust.

Product owner involvement was not as high as I expected and delegated to the team. Although this was good as it gave us more power, creating the initial connections to the developers was slow and frustrating. If I were to do this again I would stress even more how important time with the team and the developers would be.

Takeaways

So far, with the practices used above, I’ve seen not only successful deployment and use of OpenShift but a clear shift as to how people talk to each other and want to contribute to this project. Whether it’s a small company, or a global supertanker of an organisation, everyone wants to improve their ways of working. This was keenly felt here. These practices are easy to try but they take discipline and good humour to continue especially in the context of widely distributed teams. I would thoroughly recommend trying them and reporting back on which ones you picked and why.

References

If you want to learn more about practices used in this blog post visit https://openpracticelibrary.com.

If you are interested in OpenShift and learning more about the technology visit https://docs.openshift.com.

If you are interested in automation around self-created IaaS and OpenShift  follow the CASL project. This was used as an upstream OpenShift deployment tool with pre-and post installation hooks and upstream changes were made to ensure the the community would  be able to work around the customer’s required changes.

PAM – Pluggable Authentication Modules for Linux and how to edit the defaults

Most of us have been using PAM when authenticating without really thinking about it, but for the few of us that have actually tried to make sense of it, PAM is the partner that always says “no”, unless otherwise stated. It’s the bane of any sysadmin’s existence when it comes to making system x secure, and it becomes a major pain point on and off when I forget about the normal rules of engagement.

Rules of Engagement

Session windows

To engage with PAM in any combative situation, please ensure belts and braces are on, and keep your arms and legs inside the vehicle at all times. Backup the /etc/pam.d/ directory, and make sure that you have one or two non-terminating sessions open on your system – ideally a console, and an ssh session.

The unexpected overwrite

In RHEL/Fedora like systems, PAM is configured to have two main files which are included by the rest of the PAM configuration: /etc/pam.d/system-auth-ac and /etc/pam.d/password-auth-ac.  Normally, system-auth and password-auth in the same /etc/pam.d directory are links to the above files. authconfig tools will overwrite the configuration in the files with a suffix of -ac. This means that if the changes need to be persistent and not overwritten, the symlinks can be set to the new location As follows:

ls -l /etc/pam.d/*-auth

lrwxrwxrwx. 1 root root 19 Feb 19 12:57 /etc/pam.d/fingerprint-auth -> fingerprint-auth-ac
lrwxrwxrwx. 1 root root 16 Feb 19 12:57 /etc/pam.d/password-auth -> password-auth-ac
lrwxrwxrwx. 1 root root 17 Feb 19 12:57 /etc/pam.d/smartcard-auth -> smartcard-auth-ac
lrwxrwxrwx. 1 root root 14 Feb 19 12:57 /etc/pam.d/system-auth -> system-auth-ac
[root@node1 pam.d]# rm password-auth
rm: remove symbolic link ‘password-auth’? y
[root@node1 pam.d]# ln -s password-auth-local password-auth

The Language

When the usual PAM translator (authconfig) is not enough to achieve the right system authentication, one has to start thinking about communicating directly with it. PAM has four different keywords for controlling authentication with the system:

awk '!/^[#\-]/{print $1}' /etc/pam.d/*  | sort | uniq

account
auth
password
session

auth is used for providing some kind of challenge/response (depending on the module)  – usually username/password.

account is used to time or otherwise restrict the user account -i.e. user must use faillock, load all the sssd account requirements etc.

password is used to update the authtoken associated with the user account. This is mainly used to change passwords and it can be where the rules around local password strength can be formulated.

session is used to determine what the user needs before they are allowed a session: working home directory, to which system limits apply (open filehandles, number of terminals, etc.), and user keyring.

Each of these keywords has  these common modes:

required - if this fails return failure but continue executing anyway
[success=ok new_authtok_reqd=ok ignore=ignore default=bad]

requisite - if this fails return failure and die (don't even attempt to preauth)
[success=ok new_authtok_reqd=ok ignore=ignore default=die]

sufficient - this is enough for success and exit if nothing previously has failed
[success=done new_authtok_reqd=done default=ignore]

optional - we don't care unless this is the only module in the stack associated with this type
[success=ok new_authtok_reqd=ok default=ignore]

PAM execution stack

PAM executes everything sequentially unless told otherwise. The following snippet of password-auth-ac will:

# set environment variables
auth required pam_env.so
 
# delay (ms) by this amount if last time this user failed
# otherwise if absent check login.defs for specified delay
auth required pam_faildelay.so delay=2000000
 
# check user authentication from the system 
# try_first_pass - use the password that has already been entered if any
# nullok allow - allow blank password
auth sufficient pam_unix.so nullok try_first_pass
 
# succeed if uid is greater than 1000 and don't log success to the system log
auth requisite pam_succeed_if.so uid >= 1000 quiet_success
 
# deny everything else
auth required pam_deny.so

If anything is sufficient and it succeeds then the execution stack exits for the component – i.e. auth successful when local user signs in with username/password.

PAM if statements

PAM doesn’t have very obvious if statements, but given the right parameters it allows jumps of execution. Below is a way of incorporating an SSSD back-end with PAM to allow users with IdM logins access to the system:

# check if the user is allowed to log in with preauthorisation (i.e. has faillock entries)
auth        required      pam_faillock.so preauth silent audit deny=5 unlock_time=900
 
# skip two rules if successful 
# NOTE: ​default ignore  means sufficient
# and check if it's a unix user - use the password provided by the auth stack
auth        [success=2 new_authtok_reqd=done default=ignore] pam_unix.so try_first_pass
 
# if it's not a unix user, then use sssd backend for logging in
auth        sufficient   pam_sss.so forward_pass
 
# otherwise fail 
auth        [default=die] pam_faillock.so authfail audit deny=5 unlock_time=900
 
# this is the skip step from pam_unix module
# it allows for resetting the faillock when necessary
auth        sufficient    pam_faillock.so authsucc audit deny=5 unlock_time=900

The comments in line explain what each module is doing. The execution sequence reminds me a bit of jump statements in assembly language and it helps me to think about them in that manner.

Putting them all together gives us this auth section:

auth        required      pam_env.so
auth        required      pam_faildelay.so delay=2000000
auth        required      pam_faillock.so preauth silent audit deny=5 unlock_time=900
auth        [success=2 new_authtok_reqd=done default=ignore] pam_unix.so try_first_pass 
auth        sufficient pam_sss.so forward_pass
auth        [default=die] pam_faillock.so authfail audit deny=5 unlock_time=900 
auth        sufficient pam_faillock.so authsucc audit deny=5 unlock_time=900
auth        required      pam_deny.so

If we were to make slight adjustments to the above snippet, it may have the frightening effect of allowing users to log in without having the correct password:

auth        required      pam_env.so
auth        required      pam_faildelay.so delay=2000000
auth        required      pam_faillock.so preauth silent audit deny=5 unlock_time=900
# reducing this number from 2 to 1 (success=1)
auth        [success=1 new_authtok_reqd=done default=ignore] pam_unix.so try_first_pass
auth        sufficient pam_sss.so forward_pass
# swapping these two lines
auth        sufficient pam_faillock.so authsucc audit deny=5 unlock_time=900
auth        [default=die] pam_faillock.so authfail audit deny=5 unlock_time=900
auth        required      pam_deny.so

Conclusion

PAM is a very powerful, yet quite obscure tool. It can be configured to allow people in without even a valid password, or it can deny everyone access apart from every alternate Tuesday between 19:00 and 20:00 (in combination with other tools). Whenever I have configured it, I have found it useful to test for access allowed, access denied, and access locked in order to ensure predictable operation.

NOTE: To get more information around PAM, visit the man pages: (5) pam.conf , (8) pam ,  (5) password-auth

Identity Management (Continued) – The AD integration

Thoughts about AD integration with IPA

Have you ever spent days wondering what would happen if the Windows guys talked to the Linux guys and visa versa? It’s odd; organizations decide to manage identity with Active Directory and thus the Linux server estates come in as an afterthought, an add-on that needs to be integrated – and maintained – alongside the Windows boxes.

Centralised authentication is one of those integration points that someone like me has to address. A preexisting Active Directory is usually (and regrettably) the cornerstone for this concept, as it provides authentication to the desktop estate where end users usually remain, for better or for worse. Introducing Linux central authentication to servers and to users is something wonderful. Something to be celebrated by a party! No longer would you manage individual users and passwords, no longer would you have to go through individual servers deleting previous sysadmins that have pained and upset you by leaving you in this mess and no longer would you have the arduous task of clearing up after them. You can remove them from your centrally managed Linux authentication once and for all.

In fact, you can do one better than that. If your company is managing users via AD, and you are integrating AD with IPA you have the possibility to offload this task to the Windows team. They can go and disable the user from AD. All you have to do is sit back and relax.

The Integration Choice

I know my thoughts wandered at this point. All the different possibilities. How best to integrate, what are the benefits of the different options, why would people care. Decision time!

Assuming that you already decided on using IPA and have one up and running and ready to go, you will want to consider the following two possibilities:

  • the user sync from AD to IPA
  • the cross realm trust

User synchronizing from AD with IPA may leave you with user name collision, if you have similarly named users. It also doesn’t allow SSO and unique identities but it does give you the option of having two different passwords in the windows and Linux realms. Some may consider that a plus. On the down side, user management on putting users into groups is also necessary because all the users are synchronized but not the AD groups.

I prefer the cross realm trust route. It means that user management falls completely under the S.E.P field. All I need to care about is which groups of users does the company trust enough not to harm themselves and others when using Linux. Each realm would be responsible in authenticating its own users and the groups the groups that the users are in would allow them to be authorized for certain actions.

NOTE: If you don’t have an IPA server ready to go, you have very limited time, you’re not allowed to run 2 different realms, you don’t care about policy enforcement, consider integrating directly with AD. SSSD and kerberos should help you manage different users from AD and your Linux boxes talk directly to the AD forest.

The rocky road

I already have an installation that I can use for my integration. The previous article showed how to disable most IPA features. This part is showing how to live with your choices with IPA 4.1 and AD 2012R2.
IPA should have a netbios name as well as samba and winbind services configured:

ipa-adtrust-install --netbios-name=MGMT

Enable the following ports:

 firewall-cmd --permanent --add-port={138/tcp,139/tcp,445/tcp,138/udp,139/udp,389/udp,445/udp}
 firewall-cmd --reload

Add the service records to DNS. This is one of the things that allows AD to talk to IPA on a native level:

_ldap._tcp.dc._msdcs IN SRV 0 100 389 idm
_kerberos._tcp.dc._msdcs IN SRV 0 100 88 idm
_kerberos._udp.dc._msdcs IN SRV 0 100 88 idm
_ldap._tcp.Default-First-Site-Name._sites.dc._msdcs IN SRV 0 100 389 idm
_kerberos._tcp.Default-First-Site-Name._sites.dc._msdcs IN SRV 0 100 88 idm
_kerberos._udp.Default-First-Site-Name._sites.dc._msdcs IN SRV 0 100 88 idm

And create the trust (Yes, in IPA 4.1 you do have to stop the firewall and then add the trust, there are ports that aren’t mentioned in the docs):

systemctl stop firewalld
ipa trust-add --type=ad "windows.local" --admin Administrator --password
systemctl start firewalld
ipactl status
reboot

Check that the trust exists by finding the domains

[root@idm ~]# ipa trust-find windows.local
---------------
1 trust matched
---------------
  Realm name: windows.local
  Domain NetBIOS name: WINDOWS
  Domain Security Identifier: S-1-5-21-4218785893-350090421-2374357632
  Trust type: Active Directory domain
----------------------------
Number of entries returned 1
----------------------------

OK, now one kerberos realm trusts the other and this is how it should look:

Cross Realm Trust
Cross Realm Trust

The group mapping

To enable your AD users to log on to clients of IPA, you need to tell IPA how to find them. To do that you will need to add the AD group where your user is as an external group in IPA. Let’s assume you want to allow the AD Domain Admins group to ssh on to servers.

 ipa group-add --desc='AD Domain admins' domain_admins_external  --external
 ipa group-add-member domain_admins_external --external 'WINDOWS\Domain Admins

These groups won’t have any POSIX attributes as they are external groups. Every user that connects to a Linux computer requires a Linux UID and GID. This means that you will need to create another group to contain the mapped external one:

ipa group-add --desc='AD Posix Domain Admins' domain_admins
ipa group-add-member domain_admins --groups domain_admins_external

Great! You can now start configuring your sudo, host based access control and Identity Views!

Identity Management (IPA) – The ‘on the side’ installation

Thoughts about IPA installation

IPA or IdM in its Red Hat productised form is a very neat product. It allows centralised authentication and policy management while providing that over secure channels (kerberos and TLS). IdM provides quite a few features and you may decide that you’re better off without some (saving the extra calories/effort for later) as your infrastructure may already provide those on the side.

This example installation is without DNS, without a CA, and without NTP (VM installations shouldn’t really be running NTP anyway).

Once you’re past the stage of convincing management that it’ll be good for you (and good for them) to allow this installation to happen, this is what you need to think about and discuss with the team managing Certificate Authorities, NTP servers, and DNS:

  • DNS – DNS zones need to be configured in such a way that IPA acts as a KDC to its own group of servers if there are existing KDC in a different realm in the environment, they will need to be in a different subdomain/domain. The SRV records will only return the IPA servers when queried about kerberos in this subdomain.
  • Certificates – IPA uses SSL for ldap and http. IPA could be acting as a Certificate Authority but not in this instance. Active Directory (or something else) may already be configured as a Certificate Authority which could allow you to present your windows team with a certificate request from IPA to sign in order to obtain a valid web certificate.
  • Time – a uniform time source across the estate IPA servers and clients. Think about business meetings, SSL, sex, humour, and trains – all require good timing.

NOTE: IPA/IdM used to have to provide certificates by default to its clients on installation. As this is no longer the case, IPA can be installed without a CA in an easier fashion than you’re used to. Give it a try.

Prerequisite Checking for IPA installation

NOTE: The following installation is for IPA version 4.1 and AD version 2012R2.

Check that you have:

  • Access to the right software packages via yum (normal RHEL/CentOS base repo should do)
  • Forward and reverse resolvable hostname
  • An entry in the /etc/hosts with the ip address and hostname
  • nscd off
  • An up-to-date OS installation
  • Still got your sanity (test to be performed by an external third party)

The certificate creation

Create your secret private key for your server. Here, we are using openssl to generate a private key:

mkdir /root/certs 
openssl genrsa -out /root/certs/http.$(hostname).key 2048

And then, the certificate request below:

And then the certificate request below:
[root@idm certs]# openssl req  \
-key /root/certs/http.idm.mgmt.linux.local.key \
-out /root/certs/$(hostname -f).csr -new

You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [XX]:GB
State or Province Name (full name) []:Norfolk
Locality Name (eg, city) [Default City]:Norwich
Organization Name (eg, company) [Default Company Ltd]:MGMT.LINUX.LOCAL
Organizational Unit Name (eg, section) []:She ITs And Giggles
Common Name (eg, your name or your server's hostname) []:idm.mgmt.linux.local  
Email Address []:root@localhost 

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:SheITsAndGiggles
An optional company name []:

Your AD certificate Authority should be able to sign this csr and then retrieve the CA chain with the cert for IPA installation. An example on how to sign certificates in a windows CA can be found here: Generate a Digital Certificate from CSR.

To examine the Certificates, open them as follows:

openssl x509 -in <certificate> -text

The Installation

The following step just downloads the software on the server. It doesn’t start any services:

yum -y install ipa-server ipa-server-trust-ad

Now for the fun part:

ipa-server-install --http-cert-file /root/certs/http.idm.mgmt.linux.local.crt \
--http-cert-file /root/certs/http.idm.mgmt.linux.local.key \
--http-pin SheITsAndGiggles \
--dirsrv-cert-file /root/certs/http.idm.mgmt.linux.local.crt \
--dirsrv-cert-file /root/certs/http.idm.mgmt.linux.local.key \
--dirsrv-pin SheITsAndGiggles --ca-cert-file ca-chain.p7b \
 -n mgmt.linux.local -r LINUX.LOCAL --mkhomedir

NOTE: I have used the same certificate and key for the http and directory servers. The p7b file that has been downloaded from the CA is the chain.

After the installation, you will need to open all the ports for the services that we are running and add some DNS entries to advertise those services:

systemctl enable firewalld
firewall-cmd --permanent --zone=public \
--add-port={80/tcp,443/tcp,389/tcp,636/tcp,88/tcp,464/tcp,88/udp,464/udp}
firewall-cmd --reload

To tell your IPA clients what you are serving you need to advertise the services via DNS. Find an example below:

_ldap._tcp IN SRV 0 100 389 idm
_ldap._udp IN SRV 0 100 389 idm
_kerberos IN TXT MGMT.LINUX.LOCAL
_kerberos._tcp IN SRV 0 100 88 idm
_kerberos._udp IN SRV 0 100 88 idm
_kerberos-master._tcp IN SRV 0 100 88 idm
_kerberos-master._udp IN SRV 0 100 88 idm
_kpasswd._tcp IN SRV 0 100 464 idm
_kpasswd._udp IN SRV 0 100 464 idm

Check that the installation is running as it should by getting kerberos credentials for your admin user and using admin to ssh on ipa:

[root@idm ~]# kinit admin
[root@idm ~]# ssh  admin@$(hostname -f)
Creating home directory for admin.