DevConf CZ 2020

This year, like every year, I find myself wondering which conferences I should go to. DevConf is definitely high on the list as given its content which is mostly around the upstream projects of Red Hat products, the people that go to it who are open source contributors, the location of Czech republic which makes it one of the cheaper ones around and the fact that I’m not so crazily busy in January as to be able to attend it.

The content

DevConf has several streams of interesting content which can be found here. The majority of my limited time I dedicated to learning more about containers as well AI/ML. The ones I found most interesting however, were the more obscure lectures I chose to attend because of my friends choices. These were lectures with a few topics from non-mainstream subjects to keep me current in the world of open source and Red Hat.

Friday for me was the most informative day as I’ve had to travel most of Sunday, and Saturday seemed to have morphed into a networking session.

Lessons learned

These are the things I want to look up and dive into when I have a few moments of spare time:

As part of the SRE talks, an idea of configuring things like quota and limits has been generally well known but Pod Disruption budgets are something that would reward good OCP tenants with a more stable platform, allowing the specification of safety constraints on pods during operations, such as draining a node for maintenance: https://docs.openshift.com/container-platform/4.2/nodes/pods/nodes-pods-configuring.html. Environments with more sophisticated uptime requirements would benefit from this (e.g. nighttime shutdown of most nodes).

Tools that check manifests or deployment configs in order to educate developers and operations team members as to what a good application deployment looks like https://github.com/app-sre/manifest-bouncer exist. This is something that can benefit communities of operations people that want to improve their SLAs and deployments in order to get the best out of a platform. Ideally, the more robust a manifest or template has been written, the more reliable the application it deploys.

Peeking into your compiler was not quite as big as some of the subjects above but nevertheless an interesting session with Ulrich Drepper and Jakub Jelínek. It talks about the comparison of GCC and other compilers (it emits assembly as a text file), supports multiple languages with a separate front and back end and targets different architectures. It then delves into the specifics of that and how each individual compiler component can be examined. Even though I don’t think I’d soon be using anything from this talk, I have been reminded of the complexity and customisability of not only compilers but also open source in general. This is both a benefit and a curse. Yes, people need to put in the effort to gain a benefit but usually the benefits outweigh the effort.

CodeReadyContainers talks are for developers that want to test their code on a OpenShift-like environment, finally replacing Minishift for v4 with something remotely workable. Bugs such as 30-day certificate expiry of CRC have been fixed, which enables long-running instances of it. Integrations with Windows and MacOS have also been released to enable developers and organiations to use it effectively.

Performance tuning is an interesting subject from many perspectives -from compilers, as mentioned earlier, to applications. PBench is a tool that caters to a simple use case of ensuring that whatever environment a program runs on, the same suite of tools will gather information about it. Well known tools such as libvirt, kernel configuraiton, sos report and less well known tools such as block, stockpile, ara are used to compile a holistic view as to what this application needs. It would be interesting to see how well it integrates with OpenShift or other container environments and in fact Lamda AWS functions and serverless as that comes to saturate the market but that may be a long way away for this tool.

Aritficial Intelligence, and in particular Machine Learning as a subject has a lot to give. Talks around data classification and pruning were particularly informative in finding the right balance between the right output and a faster system. Compromises were made in achieving the targeted outcome by pruning the data that was correlated to each other (e.g. in property age or foundation quality are correlated and only one should be kept) in such a way that the remaining non-correlated data gives the best outcome as a very rigorous principle of Occams Razor applied. AI/ML is a subject that I’m still very new at but interested in learning as I’ve always wanted to design the Matrix and not be part of it.

How to’s and getting started with…

I generally feel like any 30 minute talk during this conference was not worth my time. I think some people would have found them mildly informative but in actuality 20 minute talking and 5 minute question answering was too little to gain a deeper understanding into a complex topic.

However, I have gathered some information on “How to get started with Operators”. Operators can be created using 3 methods: ansible, golang, and helm charts. Operator basics can be investigated through this codeshare https://github.com/cloudflightio/operator-basics.

A topic that I also found could be greatly expanded is managing keys and key servers for network bound disk encryption. Rotation of keys can happen using features from tools such as: tang, clevis, sss.

Words

If anyone wants to play buzzword bingo (it’s a real thing) when the next technology adoption phase arrives,

GitOps was a very prominent word. It was coupled with managing OpenShift or K8s clusters (with ArgoCD as the main example) and having transparent and repeatable processes (heavily featuring operators).

Multi-Cloud was another one. This is related to the ability to switch clouds at will and avoid vendor lock in. This is done by using platforms such as K8s in order to deploy containers and technologies such as Ceph and NooBaa for storage on cloud providers to provide async replication between them. It dives into application use cases for this multi cloud proposal such as re-hosting/lift and shift, re-platforming (keep existing application and build new capabilities with containers), and refactoring (re-writing with a lot of effort from monolithic to microservices architectures).

Container Security: this was a topic featured by none other than Dan Walsh and his team. Moving away from root in containers is explained at https://opensource.com/article/18/3/just-say-no-root-containers. More talks were about dropping capabilities from container runtime and generating SELinux policies for containers automatically using udicia.

Self and time Management

Conferences are always a tiring thing for me. It’s a balancing act of finding things that are worthwhile and worth spending mental energy on. Developer conferences in particular have the additional bonus for the extended time with people that are far cleverer and more accurate than I am. Anyone with impostor syndrome or on the verge of it can testify that although this is eye opening and informative, it’s not boosting anyone’s ego. Therefore, time to decompress and frequent breaks are important. I find that at least 3 longer breaks are needed for this – one every 2.5 hours approximately to break the day into manageable chunks.

Fortunately, non-techie booths, lectures, and people were plentiful. I made my break time more interesting by looking at funky stickers, talking about neurodiversity, talking about how bee’s or hive’s vision could be used as an example to solve complex problems. However, sometimes that doesn’t cut it. Sometimes, you need a quiet table or a badge that says “don’t talk to me”. That was available as part of some conference booths which I felt extremely grateful for. I may have not used it but it was there should I needed to.

Things that I’ve missed and would like to see:

I’ve found that at least on Friday, I wanted to attend at least 3 lectures that were at the same time and on Sunday attend some lectures that I couldn’t due to travel time budgeted. There should be some recordings and slide sharing of these lectures but it still doesn’t beat being there in person to ask questions and network. These are the ones I would have liked to attend but couldn’t:

Will you do it again?

Yes, I think DevConf is a conference that despite of the fact that it is swimming in Red Hat upstream content has a lot of benefits to provide to the ever-learning consultant in me. I would however supplement it with something that has a more rounded view in open source next year. Any suggestions welcome.

OpenShift – From Design and Deploy to Deliver and Transform: Optimising Distributed Teams with Agile Practices

Overview

Frequently when I’m on site I am not directly asked but I am expected to provide answers to my customers how to get the best use of a technology. In this post I’m examining a recent scenario around providing structure around deploying OpenShift in order to provide a collaboration environment that would aide the use of this technology. We were also deploying OpenShift but writing about OpenShift deployment is a well covered subject across the board.

Background

I’ve recently visited a customer that wished to containerise the world and provide to their developer community a Container As A Service (CaaS) – a single enterprise kubernentes cluster that would allow groups of developers to develop and deploy, as well as an Enterprise Kubernetes cluster As A Service (KaaS) offering – a series of clusters that would be ordered on demand by different management chains and in different security groups. Although I think the first one is easy to do and would fit many use cases, the second one is definitely more complex; big vendors and service companies still struggle in updating and maintaining multiple clusters of Kubernetes distributions especially when those distributions have massively different configurations.

When I first went on site, I realised that I was in London and my primary contacts were working remotely. This is quite uncommon for consulting engagements but it was a common theme for the organisation I was working with: distributed teams with minimal travel budgets. I need to pick my fights as to what I can change, so I set course to meet my primary contacts in a central European city that would suit them to organise a series of workshops that would help us agree on ways of working, tools, technologies, architecture etc.  Even if I was working on this project remotely for a few weeks this was a major breakthrough to the pace of work. This was a highly effective method of getting to know and trusting each other. Other than time and experience in the field, a few techniques that I used played major role in that too.

Using Open Practice Library practices in a distributed team

At the time, I had recently finished a precursor of the current  DevOps Culture and Practice Enablement – DO500  course  and I was eager to put what I had learned in practice. I thought that these methods are always effective and able to bring people together talking about the right things.

When I arrived in the mutually agreed location,  I was given a list of objectives to help the organisation deploy OpenShift Container Platform (OCP) as a service. We started first discussing why we were trying to achieve what we were trying to achieve and what success would look like using a Start At The End method. This was very useful to give us context and direction as we wanted to make sure that the business would get the most of this. It made us focus on what the end goal is: user (developer) satisfaction by creating seamless integration with current customer systems, ease of testability and  and engagement with the product.

We then followed on agreeing a list of things we would continue doing to make sure that collaboration and focus doesn’t wane; we built our foundation:

  • We decided to use pair programming techniques whereby having two people delivering a feature and many when learning something new in the platform. When using this in delivering features to the platform we ensured that knowledge is distributed across the team. This also enabled a constant channel of communication being open between distributed team members. Old fashioned video conferencing and screen sharing was sufficient at the time but we later explored tmux configuration for shared command line access to machines. Anything beyond that was a struggle regarding pair programming tooling as the environment was quite locked down to allow the live share functionaly of VSCode or something similar.
  • It was important for us to ensure that everything we did was repeatable so all the changes we wanted to do whether it was a configuration change or a build change or deploying new servers we codified first. We mainly used Ansible playbooks or Jenkins pipelines and followed the everything as code practice. We used git  which made our code versionable  and when we released a  new stable version of the platform we tagged that to indicate that point. We could always revert to a working version. This helped us a few times especially at the beginning when we needed to spin a new cluster very quickly to test new functionality.
  • We agreed on a set of rules we’d all abide by: including core hours of work, remote-scrum times,  and potentially a sense of humour. We wrote our social contract and signed it and then transcribed it to our wiki. This gave us an understanding of when  and how it was best to collaborate even with our different cultural backgrounds and timezones.

I’ve seen a few of these deployments in the past and one of the main success or failure criteria that I have seen is development and business engagement. Therefore, it was important to ensure that developers were engaged as much as possible to use this platform to test and develop.

Tweaks

From the initial set of practices we used to collaborate we found that they worked quite well but needed a few modifications and additions. Below are things that we changed or would have liked to have changed:

  • OpenShift  and Kubernetes in general is a fast moving platform, while learning about all the new components, integrations and modifications, it was important to educate our users too. We set up time during our days to absorb new material in the community by reading blog posts, following tutorials and adapting some of it for consumption of our users. This is something we then added to our social contract.
  • Empathy mapping and user interviews for increasing user engagement was something we were all interested in and it was a key factor in getting the platform moving forward. We wanted to ensure that new users of any container technology would first try and potentially succeed with OpenShift. We came up with a list of teams that were aiming to create cloud-native workloads or could benefit from a modernisation and came up with a list of questions to understand their current frustrations when developing and constraints. This gave us a direct line with our main users after we started enabling features for the platform.
  • Using principles such as everything as code is great 80% of the time for everything that is well understood. However, there was a good 20% that the value of automating something needed to be proven by testing a change worked manually first. This 20% gap was later minimised by introducing automated tests as part of building a new cluster that would give us an answer as to if our changes were sane and worked.
  • Not all scrum events worked well in this distributed team. Our daily standup ended up in a debugging session more often than not. Although this was useful I feel like we were missing the point in focusing time a bit better. I understand why too; the setting was exactly the same we were on a video call all day to each other.  It improved a little bit after one suggestion: to stand up during our phone call. However, I feel like it would have been much easier to have a scrum master to enable us.
  • Visualising our workload was something that we had to do using digital tools like wikis and digital kanban boards. However, having physical copies of our social contract and actual boards to write on would have helped massively in actually re-focusing every time we looked around or went for a coffee. Space was something that wouldn’t allow us to do that but I believe that it would bring even better results.

Next Time

These are the things I would do differently next time.

I would love to push that initial collaboration meeting a few weeks forward. This was the catalyst in actually working better together. It created connections that are so difficult to forge over the phone or video conferencing and a lot of trust.

Product owner involvement was not as high as I expected and delegated to the team. Although this was good as it gave us more power, creating the initial connections to the developers was slow and frustrating. If I were to do this again I would stress even more how important time with the team and the developers would be.

Takeaways

So far, with the practices used above, I’ve seen not only successful deployment and use of OpenShift but a clear shift as to how people talk to each other and want to contribute to this project. Whether it’s a small company, or a global supertanker of an organisation, everyone wants to improve their ways of working. This was keenly felt here. These practices are easy to try but they take discipline and good humour to continue especially in the context of widely distributed teams. I would thoroughly recommend trying them and reporting back on which ones you picked and why.

References

If you want to learn more about practices used in this blog post visit https://openpracticelibrary.com.

If you are interested in OpenShift and learning more about the technology visit https://docs.openshift.com.

If you are interested in automation around self-created IaaS and OpenShift  follow the CASL project. This was used as an upstream OpenShift deployment tool with pre-and post installation hooks and upstream changes were made to ensure the the community would  be able to work around the customer’s required changes.

The 3 Laws of Consulting

Lessons from my first 10 years of consulting in IT

Introduction

Immanuel Kant, the German philosopher, said that “experience teaches nothing without theory, but theory without experience is mere intellectual play”.

So after 10 years experience of consulting I finally feel I have enough experience to consider the theory behind what I have been doing and learn the lessons it has taught me and try to draw some real world lessons from the theoretical framework I might have been subconsciously applying.

There are many books about consulting, but most assume you are either a management consultant looking to inspire organisational change, or are an independent contractor looking to maximise revenue and sell your skills. However each of these genres have lessons to teach.

One of the best or at least most accessible theoretical frameworks I found was Gerald Weinburgers “secrets of consulting” which presents his three laws:

  • There’s always a problem
  • It’s always a people problem 
  • Never forget they’re paying you by the hour.

These seem simplistic and not a little cynical, but when you think carefully about them there is surprising depth to each.

1. There is always a problem.

This is probably the most obvious but also the most fundamental rule. There is always a problem, after all, a customer is spending a substantial amount of time and money to have a consultant come in and work with them, if there wasn’t an issue of some sort that they thought a consultant could solve they would spend it elsewhere. No problem, no consultancy needed.

The most obvious IT consulting problem is to install and configure a piece of software, or maybe even to write it from scratch. The customer has decided they need some functionality, and they need it enough to pay for a consultant to make sure it is set up correctly, according to best practices. 

At the other end of the scale is the consulting engagement born out of chaos and disaster, sometimes fixing breakages, fighting fires, sometimes trying to drag their systems and practices kicking and screaming into the 21st century.

Many stages exist between these two extremes but there is always something they need, a problem to be solved, that they cannot provide themselves.

1.1 Stating the problem

In any well organised consulting engagement there should be a scope of works agreed upon between the consultant and the customer. This makes up part of the contractual agreement between the two, a legal agreement that should stand up in court if required.

The scope of works (or SOW) provides the description for all the problems the consultant should solve, work outside of the SOW can prove a major issue to a successful engagement, as it always eats into the time allowed.  It can also open the consultant and their employer up to legal issues, an uncompleted contract and even the coverage of liability insurance, which is often based upon the contract, are both potentially highly costly legal rabbit holes that no one should ever want to venture down.

In any engagement the predefined statement of the problem contained in the scope of works should provide the happy path, and venturing from it should be attempted only with great caution and the agreement of the customer.

1.2 The problem they hire you for, the problem they think they have, and the problem they really need you to solve

Just because a customer states they have a problem, gets budget to fix it, spends a lot of time discussing with salespeople about, writes and agrees a scope of work, and then commits time and effort to help a consultant solve it, does not mean that is the problem they actually have.

This strange fact is much more common than you might expect and the reasons are many and varied.   The chief amongst them is that they have made massive assumptions about both the problem and the solution.  This follows the standard pattern described in the sitcom ‘Yes Minister’ as politicians logic: cats have 4 legs, my dog has 4 legs, therefore my dog is a cat, or in the fad driven world of IT, we need something new, this is new, therefore we need it. The reality is often that there are better approaches to take.

Alternatively the solution may fix the problem, but the problem is really only a symptom of a deeper, more fundamental issue that they either have not understood or are unwilling to face. Patching up ancient software or hardware will fix the immediate failures, but upgrading to modern equipment will fix the problem.

In both these cases a consultant might feel the need to break away from the scope of works, to have a conversation about “why are you doing this, that looks like it might be more appropriate”. Just be prepared to back away if the response is negative. There may be much more going on then you can see. 

Changing the scope of the engagement is really negotiating a new (albeit often informal) scope of works. This needs agreement from both the customer and those who helped scope the project (the sales team and project manager at a minimum), explaining the risks and costs, and being very clear that spending time on this will mean that time cant be spent on the stated problem. Its best to get this agreement in writing, so later, when the discussion is forgotten you can still explain why the original scope of works has not been completed.

1.3 But it doesn’t do that…

Sometimes this seeming duplicity isn’t the customers fault, the sales team have misunderstood the problem or the product and sold them a sub optimal solution. Sometimes the customer knows full well what the problem really is but wants to paper over the cracks because maybe it was their fault or admitting the real issue would lead to a loss of face.

With miss-selling by the consultants own company this can be a major ethical question, do you  plough on or do call out the issue. The best starting point is always to talk to the sales team and find out why it has happened, and agree with them the approach to take. Without a unified approach, the consultant will antagonise both sides, whereas a united front at least gives you friend to fall back on.

When it is the customer deliberately doing the wrong thing it calls for the greatest amount of tact and diplomacy, and to be honest, it may not be worth even having the conversation. If they are really set on making a mistake, then it’s time to invoke the third law (they’re paying you by the hour) and accept it’s their money to waste, it’s not fun but you get paid either way.

1.4 The end of the affair 

In the end, if there really isn’t a problem, or you can no longer see it, it’s time to leave. When you stop learning new things, it’s time to move on.

It’s often tempting to stay on, the job is comfortable and the customer a known quantity, but at some point you stop providing value over and above a permanent employee or general contractor, and they stop providing you with interesting and relevant challenges that meet your interests and career aspirations.

2. There is always a people problem (and sometimes it’s you) 

While most consultants are engaged to deal with a technical problem, if that was all a good consultant needed then they would be ten a penny.  In fact the biggest maker and breaker of consulting engagements comes from dealing with people and personalities, and there is always a people problem.

Consulting engagements always grow out of either a lack of skills, or a lack of manpower at the customer end. If they could do the work themselves they wouldn’t pay someone else, and so any consulting engagement requires an implicit admission of weakness on their part, and this leads to a plethora of potential issues that will come up to bite the unwary consultant.

2.1 Unknown unknowns.

Because the customer lacks the skills to do the work themselves, they also lack the knowledge of the resources a consultant will need to do the work. A consultant might be the greatest expert in their field but only the customer knows their environment, so guidance is needed to help navigate those waters, integrating with the customer environment, setting things up, investigating and changing processes, and a thousand and one other tasks.

The customer knows none of this and so the consultant needs to expend significant time and effort in to tracking down the right people, getting time with them, and explaining exactly what is required and why. There is a delicate art in persuading someone who is busy with work assigned to them by their manager to drop everything and deal with a consultant who is often doing something that will make no difference to them at all.

In this there is a huge amount of scope for delays creep in causing a time crunch as deadlines approach. This requires diplomacy, charm, persistence, and sometimes the application of a metaphorical hammer to deal with. Knowing when to use the hammer, and when to use charm is a core skill for any consultant.

2.3 People personality problems

The real people problems are less benign however.  It is almost impossible to bring a consultant into a company without treading on someone’s toes, and so there’s a whole set of problem people that may hold antagonism toward the consultant.

2.3.1 Technical responses

Amongst technical people a very common source of issues are those who think (or know) they have the skills needed to do the job but feel they have been unfairly overlooked by a management who don’t understand their abilities.

This can manifest in two ways, the positive response comes from those who want to learn what secret sauce the consultant is bringing and look to learn new things, they will often want to sit and watch, to discuss over lunch, and ask for walkthroughs. This can be annoying when there is work to be done, but it’s also a behaviour that means there is a strong chance of the solution continuing to survive and be used beyond the consultants tenure.

The other common reaction is to feel threatened, as if the consultants presence is an insult to their abilities, and might lead to them being replaced. These people will try to undermine the project, either by going slow and doing the minimum necessary to help. They may also attempt to show the consultant up in public and undermine faith in them with public questions designed to catch them out requiring very detailed technical knowledge that is only tangentially related to the matter at hand, and so prove they are better at their job then any newcomer.

With disruptive and obstructive people sometimes it’s easier to let them have their small victory and move on, giving them the security they need to provide you with help down the road.  This takes a suppression of the consultant’s ego, but it’s not personal.

2.3.2 Management responses

Managers fall into two camps, there are those who were involved in bringing the consultant in and are supportive and want a successful engagement to prove to their peers they have made a good decision.  This can boil over into unrealistic expectations, thinking because the consultant is an expert in the field they therefore know everything and can perform any task in 15 seconds flat at the drop of a hat, and will therefore produce months worth of value for weeks worth of cost.

The second camp are those who are hostile, rarely openly so, being seen to oppose a successful project reflects badly on them, but often they think the money could have been better spent following their proposed solution to the problem and by growing their empire and prestige.

2.4 Honesty

Both management and technical resistance is best dealt with early, the longer it goes on the more it festers and grows.  A number of strategies exist for this but by far the easiest is simply to be honest about what you are there to do, and what you are not there to do, and to set expectations at a level that are good but not stellar.  Nobody likes to think of themselves as bad at their job, and a consultant who promises super human (to them) results is promising to show them up to their peers and management.

A good consultant needs to be seen to be worthy of the money they are paying, but being 10% better than average is good, 100% is frightening.

Expectations set on the first day of the engagement are easy to exceed later, in small gradual steps, at each stage giving them time to get used to the new level. As the old adage says boiling a frog is best done by slowly raising the temperature.

3. Never forget they’re paying (for) you by the hour (day/week)

It is tempting to think about this law as simply saying that they are paying you by the hour so the slower you can perform a task the more hours it takes and so the more you earn. Customers are not stupid (although it is often tempting to think otherwise), and they are very wise to this trick, they know when they are spending money and getting little or no value for it, and will not renew an engagement they see as a money pit.

A slightly less obvious but wiser reading is, they are paying you by the hour, which makes you different from their employees, so different things are expected from a consultant.

3.1 Limits and boundaries to work

The difference between an employee and a consultant is that an employee does whatever is needed, and a consultant does whatever is paid for. That is to say an employee will have a contract that says something like “and other tasks as required” and a consultant has a contract which has a clearly defined and limited scope of works that they need to complete in the agreed and defined number of days (hours, months) stated.  An employee can undertake many tasks at once, and push out the completion date, the consultant does not have this luxury and must be focused on the deadline. 

To achieve this the consultant must always keep their eyes on the goal of completing the scope of works. Accepting extra tasks, however interesting, useful, or related to the work in hand they might be, is a major risk to that goal. As a rule of thumb anything that can be dealt with by a 10 minute conversation or email should be, and anything that would require longer than that should be met with a polite refusal or suggestion that it will be dealt with after the main project has been completed.  Occasionally the extra tasks are worth it and the risk can be managed, but failing to complete the scope of works is failing at the engagement no matter how many other tasks have been achieved.

When deadlines approach, and work begins to crunch there is a strong temptation to work longer and longer hours. For short stretches this can be an unfortunate necessity but attempts to sustain it over longer periods inevitably lead to consultant burnout and failure. A project that requires constant long days is a project that is badly scoped, and a customer that is getting much more for their money than they should expect. If you are doing 16 hour days then the customer should be paying for double the amount of time, or the project scope should be half what it is. It is not the consultants role to sacrifice themselves performing impossible tasks.

3.2 Confidence is the greatest weapon

Companies hire consultants to perform some task they cannot perform themselves, that means they expect the consultant to be an expert in whatever subject they need them to be. Sadly this is almost impossible, consultants are not superheroes. They do have special expertise though that provides the next best thing, the confidence and self assured attitude that convinces the customer that they are in safe hands and everything is going to be OK and it will all turn out fine in the end.

Of course beneath the surface absolute panic and terror might be brewing but the calm exterior that the world sees is what makes the reputation.

This can be as simple as how you phrase your comments. 

“I don’t know” is a bad answer, “it’s probably X but let me just check” builds confidence that you are knowledgeable but cautious. 

“I need to follow the documentation” is bad, “let’s follow the docs because missing a step will cause pain later” tells them you’ve been there before and are bringing lessons learnt to their benefit.

With careful phrasing and a willingness to work competence can be faked. It’s common for a consultant to have a lot of background knowledge but only be a page ahead of the customer in actual technical knowledge. It may require serious skills in diplomatic bluff to pull this off, but when it works the results is the adrenaline rush of a job well done and a customer that is none the wiser

3.3 They are not your friends

In the end the relationship between the consultant and their customer is a professional one, and while building a good friendly open relationship will oil the wheels and make the engagement much easier, they are paying for you by the hour.

No one wants to be told how terrible they are, even when they know all of the flaws in their organisation having them pointed out by the hired help is a blow to their ego. This is especially true for managers whose responsibility is to fix the flaws. The easiest way to deal with this ego bursting is to get rid of it by firing the consultant, and this is clearly a bad thing.  This goes for all conversations on site, even when you think you are in private. When you are on a customer site you should always assume they can hear you and treat them respectfully and professionally. The results of an unguarded comment are too great to risk.

A classic case of this is the “I bet you’ve never been to a customer as messed up as us” where “I think you’re all incompetent idiots” is clearly the wrong response, “no you’re pretty bad” may not use quite the same words, but incompetence is still implied. A much more diplomatic (and often more truthful) response might be “everywhere does some things well and some things badly” which subtly says “you’re smart enough to know there are issues but also smart enough to have fixed some already”

Conclusion

The three laws of consulting are flippant, and deliberately so, losing your sense of humour is the surest way to let consulting drag you down, but beneath the humour they do provide a framework for what to expect and how to approach the engagement to maximise the chances of success.


They boil down to:

  • The problem you’re there to solve
  • The people you have to deal with
  • You, and your relationship with the customer.

Or even further reduced, there are three aspects of consulting:

  • The Problem
  • The People
  • The Consultant

Which, for the less humourous and more mathematically minded can be summarised as:

problem + people + consultant = engagement

It is certainly possible to have a successful engagement while ignoring one or sometimes even two of this triumvirate, but that requires knocking the others out of the park, and that is a much harder proposition leaving little room for problems. The strategy that all consultants in whatever field should go into a customer aiming for is to balance. The three and the three laws of consulting provide a good starting point.

A trip into dbus-send

DBus is the interprocess communication mechanism used by the plumbing layer of Linux to allow the various components to use each others services without each of them needing to implement custom code for every other component. Even for a sysadmin its a fairly esoteric subject but it does help explain how another bit of linux works. However I’ve always found exploring it confusing and I tend to forget the hard won lessons soon afterwards. This tim eI decided to post my notes to the internet in the hope that next time I feel the need to explore the world of dbus I have a place to start.

Firstly there are two buses, a session bus that is individual to the users session and used heavily by gnome and other user software, and the system bus used by the plumbing layer.

Each bus has a number of services registered on it that can be listed with the dbus-send command by querying the DBus service itself.

dbus-send --system --print-reply --dest=org.freedesktop.DBus /org/freedesktop/DBus org.freedesktop.DBus.ListNames

which produces an array of the services:††

   array [
      string "org.freedesktop.DBus"
    ...
      string "com.redhat.NewPrinterNotification"
      string "com.redhat.PrinterDriversInstaller"
      string "com.redhat.ifcfgrh1"
      string "fi.w1.wpa_supplicant1"
      string "org.fedoraproject.FirewallD1"
      string "org.freedesktop.Accounts"
      string "org.freedesktop.Avahi"
      string "org.freedesktop.ColorManager"
      string "org.freedesktop.ModemManager1"
      string "org.freedesktop.NetworkManager"
      string "org.freedesktop.PolicyKit1"
      string "org.freedesktop.RealtimeKit1"
      string "org.freedesktop.resolve1"
      string "org.freedesktop.systemd1"
      string "org.gnome.DisplayManager"
   ]

Each service implements at least one object, which has a number of interfaces, which are groups of methods and properties. Lots of different objects can implement the same interface, and so implement the same set of methods and properties this makes it easy to treat them all in a similar way. Two interfaces pretty much everything implements are the org.freedesktop.DBus.Introspectable interface which has a single method “Introspect” that returns a list of all the interfaces, methods, and properties the object implements. The other common interface is org.freedesktop.DBus.Properties which provides common methods to get and set all the properties of the object.

A dbus send command to invoke one of these methods looks like

dbus-send --system --print-reply --dest=[service] [objectname] [interface].[method] [parameters]

so a call to the org.freedesktop.systemd1 service listed in the output above needs the name of an object, but all services implement an object named similar to their service name except with / replacing dots giving /org/freedesktop/systemd1 and as discussed above all objects implement the org.freedesktop.DBus.Introspectable interface with its Introspect method, so we can call:

dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1   org.freedesktop.DBus.Introspectable.Introspect

Which lists all the methods and properties available for systemd in an xml format, e.g.:

<interface name="org.freedesktop.systemd1.Manager">
  <method name="ClearJobs">
  </method>
  <method name="ResetFailed">
  </method>
  <method name="ListUnits">
   <arg type="a(ssssssouso)" direction="out"/>
  </method>
  <method name="ListUnitsFiltered">
   <arg type="as" direction="in"/>
   <arg type="a(ssssssouso)" direction="out"/>
  </method>
        ...
</interface>

If you’ve followed along so far it should then come as no surprise that you can run the ListUnits method and get a list of all the units managed by systemd:

dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1   org.freedesktop.systemd1.Manager.ListUnits 
...
      struct {            
         string "ypbind.service"                                                                                       
         string "ypbind.service"                                                                                       
         string "not-found"                                                                                            
         string "inactive"
         string "dead"                                  
         string ""                                                                                                     
         object path "/org/freedesktop/systemd1/unit/ypbind_2eservice"                                                 
         uint32 0         
         string ""                                                                                                     
         object path "/"                                                                                               
      }         

(the 2e in the path is the ascii code for a period. the question of why I have ypbind installed on my laptop is for another day, but at least its not running)

This tells us that ypbind has its own object on the bus called /org/freedesktop/systemd1/unit/ypbind_2eservice, and indeed

dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/ypbind_2eservice  org.freedesktop.DBus.Introspectable.Introspect

Lists its methods and properties because it too implements the org.freedesktop.DBus.Introspectable interface. The –dest option is still set to org.freedesktop.systemd1 because ybbind_2eservice is a child object of systemd1. But why was this not listed when we introspected the org.freedsktop.DBus object?

Objects within a service form a hierarchy, and introspecting any object will only list the direct child objects that belong to it. org.freedesktop.DBus has no child objects of its own and the ListNames method we used initially lists the top level services register on the bus, not all the objects they provide. So how then do we seek out these child objects?

When you introspect an object it not only lists the methods and properties of the object, but it also lists the names of the child objects or nodes that sit directly below it. We can see these by introspecting again:

dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1   org.freedesktop.DBus.Introspectable.Introspect  | grep node
<node>
 <node name="unit"/>
 <node name="job"/>
</node>

We can then introspect these in turn

dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1/unit   org.freedesktop.DBus.Introspectable.Introspect  | grep node
...
 <node name="rpcbind_2eservice"/>                                                                                                                                                                                                             
 <node name="iscsi_2eservice"/>                                                                                                                                                                                                               
 <node name="rpc_2dstatd_2eservice"/>        
 <node name="ypbind_2eservice"/>

and as these are children of /org/freedesktop/systemd1/unit that gives us the /org/freedesktop/systemd1/unit/ypbind_2eservice object we saw above.

While listing objects is fun we really want to be able to send then messages over the bus, or in other words call their methods.

Pretty much every object on the bus implements the org.freedesktop.DBus.Properties.Get interface that provides methods to set and get parameters. In the introspection output above there were methods described like:

 <interface name="org.freedesktop.DBus.Properties">                                                                    
  <method name="Get">                                                                                                  
   <arg name="interface" direction="in" type="s"/>                                                                     
   <arg name="property" direction="in" type="s"/>                                                                      
   <arg name="value" direction="out" type="v"/>                                                                        
  </method>                                                                                                            
  <method name="GetAll">                                                                                               
   <arg name="interface" direction="in" type="s"/>                                                                     
   <arg name="properties" direction="out" type="a{sv}"/>                                                               
  </method>                                                                                                            
  <method name="Set">                                                                                                  
   <arg name="interface" direction="in" type="s"/>                                                                     
   <arg name="property" direction="in" type="s"/>                                                                      
   <arg name="value" direction="in" type="v"/>                                                                         
  </method> 
...
 <interface name="org.freedesktop.systemd1.Service">
  <property name="MainPID" type="u" access="read">
  </property>

Arguments with direction “in” obviously need to be passed in and those with direction “out” are the return values. Each parameter needs to have its data type appended to its value, and they are passed on the dbus-send command line in the order specified. So to get the value of the MainPID property we need to pass the interface and the property name, both strings (s):

# dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/docker_2dcontainerd_2eservice org.freedesktop.DBus.Properties.Get string:org.freedesktop.systemd1.Service string:MainPID

method return time=1561025611.053851 sender=:1.5 -> destination=:1.28678 serial=39522 reply_serial=2
   variant       uint32 22193

and we get back a variant (type=v), in this case the pid of the docker containerd process

# ps -ef | grep 22193

root     22193     1  0 10:02 ?        00:00:00 /usr/libexec/docker/docker-containerd-current --listen unix:///run/containerd.sock --shim /usr/libexec/docker/docker-containerd-shim-current --start-timeout 2m

Other methods let us start and stop the unit

  <method name="Start">
   <arg type="s" direction="in"/>
   <arg type="o" direction="out"/>
  </method>
  <method name="Restart">                                                                                              
   <arg type="s" direction="in"/>                                                                                      
   <arg type="o" direction="out"/>                                                                                     
  </method>

These parameters have no names given but a bit of trial and reading of error messages suggests that in “in” parameter is job-mode, described in the systemctl man page as controling “how to deal with already queued jobs”, with “replace” as a valid option. So we can restart the service simply by:

dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/docker_2dcontainerd_2eservice  org.freedesktop.systemd1.Unit.Restart string:replace

And we can now inspect the PID of the process again:

# dbus-send --system --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/docker_2dcontainerd_2eservice org.freedesktop.DBus.Properties.Get string:org.freedesktop.systemd1.Service string:MainPID

method return time=1561069031.851603 sender=:1.5 -> destination=:1.34460 serial=47532 reply_serial=2
   variant       uint32 31169

The process number is now reporting as 31169, and when we double check:

# ps -ef | grep 31169
root     31169     1  0 23:14 ?        00:00:00 /usr/libexec/docker/docker-containerd-current --listen unix:///run/containerd.sock --shim /usr/libexec/docker/docker-containerd-shim-current --start-timeout 2m

And we can see the pid of the docker service has changed after it was restarted.

And there we have it, how to explore and manipulate the system via dbus.

One final bonus, systemd can also be used to control the power state of the machine, is this the most verbose reboot command possible?

dbus-send --system --type=method_call --print-reply --dest=org.freedesktop.systemd1 /org/freedesktop/systemd1/unit/reboot_2etarget org.freedesktop.systemd1.Unit.Start string:replace-irreversibly

SSL and the Openshift API

Adventures trying to use the OpenShift and Kubernetes APIs on RHEL, Windows, and from within a pod.

One of the great features of Kubernetes and OpenShift is its API and the ability to perform any action by making the correct REST call, but as a wise man once said the three great virtues of a programmer are laziness, impatience, and hubris. Making REST calls explicitly and dealing with the results, even in a language like python, is long-winded and hard work so obviously libraries have been that wrap up the OpenShift API to make life easy.

Sadly writing great documentation is really not a virtue that comes naturally to programmers so the OpenShift python libraries take a little figuring out to make work, especially when you start hitting weird errors or working on platforms that are not well tested.

Installation is pretty straightforward, a simple ††

pip install openshift

will install both the openshift python library, and the Kubernetes python library that it builds upon. Alternately on RHEL7 a simple

yum install python2-openshift.noarch

should work. Failing that the code for the openshift library is part of the official openshift release and can be found at https://github.com/openshift/openshift-restclient-python

Once you have it installed using it is simple:

from kubernetes import client, config
from openshift.dynamic import DynamicClient

k8s_client = config.new_client_from_config()
client = DynamicClient(k8s_client)
v1_pod = client.resources.get(api_version ='v1', kind='Pod')                                        
podList = v1_pod.get(namespace = "default")
for pod in podList['items']:
    print("%s %s"%(pod.metadata.name,pod.status.podIP))

which should use the credentials stored in .kube/config (generated by the oc login command) and use them to get all the pods in the clusters default namespace, and list their names and IPs.

Sadly the difference between theory and practice is that in theory they are the same, and in practice they’re not, so what actually happened when I ran that on my development platform, a slightly left field mix of Windows10, python, and mingw (don’t ask, the customer makes the rules) I was greeted with an ugly SSL error message clearly suggesting a certificate problem:

2019-03-19 17:21:30,191 WARNING Retrying 
(Retry(total=2, connect=None, read=None, redirect=None, status=None))
 after connection broken by 'SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED]
certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)'))': /version

There is an issue in github that seems to fit, https://github.com/openshift/openshift-restclient-python/issues/198 but no solution.

An experiment with the pure Kubernetes library had the same result.

Several frustrating hours later it appears that there is something strange in how python on windows deals with certificates, quite what would take a lot longer to figure out than I had, certainly longer than it took to figure out the workaround!

That workaround is pretty simple and obvious, you just turn off ssl verification, its not exactly great practice, and should never be done in production, but the technique opens the way to a more flexible approach to authorisation. This leaves our little demo program looking like:

import os
from kubernetes import client, config
from openshift.dynamic import DynamicClient

token = os.getenv("OCP_TOKEN")

config = client.Configuration()
config.host = "https://example.com:8443/"
config.verify_ssl = False
config.api_key = {"authorization": "Bearer " + token}

k8s_client = client.ApiClient(config)
client = DynamicClient(k8s_client)

v1_pod = client.resources.get(api_version ='v1', kind='Pod')
podList = v1_pod.get(namespace = "default")
for pod in podList['items']:
    print("%s %s"%(pod.metadata.name,pod.status.podIP))

Complete with reading the token from an environment variable which you can set with

export OCP_TOKEN=`oc login -t`

And there you are, a simple workaround to an annoying problem.

RHEL

Interestingly while testing out that bit of code I discovered another annoying ssl issue, this time running on RHEL7, where turning off SSL verification leads to the following error:

urllib3.exceptions.SSLError: 
[Errno 2] No such file or directory

This one comes down to the python2-certifi package which seems to point to a certificate file that it should provide itself but doesn’t, and is needed even with verification turned off. This looks to be fixed in the python3 version, but python2 is still the default on RHEL7 so that’s a little annoying. Luckily, this gives us the chance to make out little test program a lot better, by turning verification back on and pointing it to the system certs provided by the ca-certificates package:

import os
from kubernetes import client, config
from openshift.dynamic import DynamicClient

token = os.getenv("OCP_TOKEN")

k8sconfig = client.Configuration()
config.host = "https://example.com:8443/"
k8sconfig.verify_ssl = True
k8sconfig.ssl_ca_cert = "/etc/pki/tls/certs/ca-bundle.crt"
k8sconfig.api_key = {"authorization": "Bearer " + token}

k8s_client = client.ApiClient(k8sconfig)
client = DynamicClient(k8s_client)
v1_pod = client.resources.get(api_version ='v1', kind='Pod')
mypod = v1_pod.get(namespace = "default")
for pod in mypod['items']:
    print("%s %s"%(pod.metadata.name,pod.status.podIP))

And there we have it, the different variations of the program we need to get it working on different client platforms.

As a final word the kubernetes python library is similar but a little different, the same program would look more like:

import os
from kubernetes import client, config

k8s_client = config.new_client_from_config()

token = os.getenv("OCP_TOKEN")

k8sconfig = client.Configuration()
k8sconfig.host = "https://example.com:8443"
k8sconfig.verify_ssl = True
k8sconfig.ssl_ca_cert = "/etc/pki/tls/certs/ca-bundle.crt"
k8sconfig.api_key = {"authorization": "Bearer " + token}

k8sclient = client.ApiClient(k8sconfig)
v1 = client.CoreV1Api(k8sclient)
podList = v1.list_namespaced_pod(namespace="default")
for pod in podList.items:
    print("%s"%(pod.metadata.name))

Postscript

Just for fun, if you wanted to run this from within an OpenShift container there are a couple of changes that are worth making. There is a service that gets set up automatically which allows pods to contact the API server at https://kubernetes.default.io and the certs required to talk to it are injected into all pods at run time. So a our little test program becomes:

import os
from kubernetes import client, config
from openshift.dynamic import DynamicClient

token = os.getenv("OCP_TOKEN")

k8sconfig = client.Configuration()
config.host = "https://kubernetes.default.io"
k8sconfig.verify_ssl = True
k8sconfig.ssl_ca_cert = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
k8sconfig.api_key = {"authorization": "Bearer " + token}

k8s_client = client.ApiClient(k8sconfig)
client = DynamicClient(k8s_client)
v1_pod = client.resources.get(api_version ='v1', kind='Pod')
mypod = v1_pod.get(namespace = "default")
for pod in mypod['items']:
    print("%s %s"%(pod.metadata.name,pod.status.podIP))

OpenShift Container Platform 4 on Azure using Installer Provisioned Infrastructure

Overview

This post is entirely for fun. I am trying a developer preview product – the OpenShift Container Platform 4 (OCP 4) Installer Provisioned Infrastructure (IPI) on Microsoft Azure.

I really didn’t want the day to end in blood, sweat, and tears so I went through as much documentation as I could related to OCP4.1 about AWS, Azure, and generally some code. I created a pay as you go account an purchased a domain name. For now let’s call it example.com.

Blood, Sweat, and Tears (or not)

The first thing I created in Azure was a resource group called openshift4-azure. I created a public DNS zone with a DNS name  in that resource group that was delegated for management to the Azure DNS servers. This is to manage the entries that the OCP 4. installer will need to create in order to manage traffic into the cluster.

I then created my local golang environment. I needed to create a golang environment and path https://golang.org/doc/install. This was to compile the installer for Azure. The binaries are not readily available yet.  To test that this was working env |grep GOPATH.  My path is $HOME/go.

I then forked the openshift/installer repository and cloned it in the go path: $HOME/go/src/github.com/openshift/ . I added the correct upstream git remote add upstream https://github.com/openshift/installer.git to my fork in case I made code/documentation changes for PRs. To build the binary I needed I run ./hack/build.sh from the installer. This created the installer in the bin  folder.

I followed the instructions at https://github.com/openshift/installer/tree/master/docs/user/azure/install.md to clone an image for CoreOS in my region. In every region where I want to create a cluster I need to copy the same image. I wanted to run these repeatedly so I downloaded the Azure CLI  from https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-yum?view=azure-cli-latest. As I’m running this in uksouth so these are the commands I needed to run:

export VHD_NAME=rhcos-410.8.20190504.0-azure.vhd
az storage account create --location uksouth --name ckocp4storage --kind StorageV2 --resource-group openshift-azure
az storage container create --name vhd --account-name ckocp4storage
az group create --location uksouth --name rhcos_images
ACCOUNT_KEY=$(az storage account keys list --account-name ckocp4storage --resource-group openshift-azure --query "[0].value" -o tsv)
az storage blob copy start --account-name "ckocp4storage" --account-key "$ACCOUNT_KEY" --destination-blob "$VHD_NAME" --destination-container vhd --source-uri "https://openshifttechpreview.blob.core.windows.net/rhcos/$VHD_NAME"

To create a unique storage account it took me a few tries. I think it needs to be unique across a region.

It is recommended to use Premium_LRS sku. To get premium storage in Azure in a PAYG account,  I needed to enable the right subscription in the storage provider in PayAsYouGo subscription -> Resource Providers. This needed to be registered. Before creating the image, the storage blob needs to finish creating otherwise you get the following error:

Cannot import source blob https://ckocp4storage.blob.core.windows.net/vhd/rhcos-410.8.20190504.0-azure.vhd since it has not been completely copied yet. Copy status of the blob is CopyPending.
export RHCOS_VHD=$(az storage blob url --account-name ckocp4storage -c vhd --name "$VHD_NAME" -o tsv)
az image create --resource-group rhcos_images --name rhcostestimage --os-type Linux --storage-sku Premium_LRS --source "$RHCOS_VHD" --location uksouth

I created a service principal for my installation and copied the following somewhere safe:

 az​ ad sp create-for-rbac --name   openshift4azure
{
"appId": "serviceprincipal",
"displayName": "openshift4azure",
"name": "http://openshift4azure",
"password": serviceprincipalpassword",
"tenant": "tenant id"
}

And gave it the following access:

az role assignment create --assignee serviceprincipal --role "User Access Administrator"
az role assignment create --assignee serviceprincipal --role "Contributor"

I then got my oc client and pull secret as described at https://cloud.redhat.com/openshift/install/azure/user-provisioned.

I tried my first Azure IPI OCP4 install and the first thing that I got was the following.

openshift-install create cluster
? SSH Public Key $HOME/.ssh/id_rsa.pub
? azure subscription id yyyy-xxxx-nnnn-bbbb-fffffff
? azure tenant id yyy-xxxx-nnnn-bbbb-nnnnnn
? azure service principal client id yyy-xxxx-nnnn-bbbb-ccccccccc
? azure service principal client secret [? for help] ************************************
INFO Saving user credentials to "$HOME/.azure/osServicePrincipal.json" 
? Region uksouth
? Base Domain example.com
? Cluster Name attempt1
? Pull Secret [? for help] ***********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
INFO Creating infrastructure resources... 
^CERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/bootstrap/main.tf line 117, in resource "azurerm_virtual_machine" "bootstrap": 
ERROR 117: resource "azurerm_virtual_machine" "bootstrap" { 
ERROR 
ERROR 
ERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/master/master.tf line 44, in resource "azurerm_virtual_machine" "master": 
ERROR 44: resource "azurerm_virtual_machine" "master" { 
ERROR 
ERROR 
ERROR 
ERROR Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Operation results in exceeding quota limits of Core. Maximum allowed: 10, Current in use: 8, Additional requested: 8. Please read more about quota increase at https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/{\"subId\":\"ae90eef6-f8ea-479c-8c6a-9dd4bf9e47d0\",\"pesId\":\"15621\",\"supportTopicId\":\"32447243\"}." 
ERROR 
ERROR on ../../../../../../../../tmp/openshift-install-216822811/master/master.tf line 44, in resource "azurerm_virtual_machine" "master": 
ERROR 44: resource "azurerm_virtual_machine" "master" { 
ERROR 
ERROR 
FATAL failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to apply using Terraform

Standard PAYG account does not allow for the amount of resources that IPI will create. It requires more than the 10 compute resources available so I needed to increase compute quota to allow for creation:

Resource Manager, UKSOUTH, DSv2 Series from 10 to 100
Resource Manager, UKSOUTH, DSv3 Series from 10 to 100

I needed to export the environment variable for the install image for RHCOS which I found from the account storage account blob:

export OPENSHIFT_INSTALL_OS_IMAGE_OVERRIDE="/resourceGroups/rhcos_images/providers/Microsoft.Compute/images/rhcostestimage"

I destroyed the stack oc destroy cluster --dir=cluster-dir to try again and watched with glee as my Azure attempt-1 resource group diminished. It was then time for attempt 2 for which I also passed the Azure authentication credentials location in a json file by exporting this variable AZURE_AUTH_LOCATION=creds.json. Baaaad idea. The installer overwrote my credentials location. It’s a good thing I had a copy and didn’t particularly care.

Attempt 2 seems to have worked. I have an operational cluster. All my operators are running in a good state (not degraded and not progressing):

bin]$ ~/bin/oc get co
NAME                                       VERSION                         AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.okd-2019-06-25-110619   True        False         False      46m
cloud-credential                           4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
cluster-autoscaler                         4.2.0-0.okd-2019-06-25-110619   True        False         False      65m
console                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      50m
dns                                        4.2.0-0.okd-2019-06-25-110619   True        False         False      65m
image-registry                             4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
ingress                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      53m
kube-apiserver                             4.2.0-0.okd-2019-06-25-110619   True        False         False      61m
kube-controller-manager                    4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
kube-scheduler                             4.2.0-0.okd-2019-06-25-110619   True        False         False      61m
machine-api                                4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
machine-config                             4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
marketplace                                4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
monitoring                                 4.2.0-0.okd-2019-06-25-110619   True        False         False      52m
network                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
node-tuning                                4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
openshift-apiserver                        4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
openshift-controller-manager               4.2.0-0.okd-2019-06-25-110619   True        False         False      62m
openshift-samples                          4.2.0-0.okd-2019-06-25-110619   True        False         False      53m
operator-lifecycle-manager                 4.2.0-0.okd-2019-06-25-110619   True        False         False      63m
operator-lifecycle-manager-catalog         4.2.0-0.okd-2019-06-25-110619   True        False         False      63m
operator-lifecycle-manager-packageserver   4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
service-ca                                 4.2.0-0.okd-2019-06-25-110619   True        False         False      66m
service-catalog-apiserver                  4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
service-catalog-controller-manager         4.2.0-0.okd-2019-06-25-110619   True        False         False      60m
storage                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      59m
support                                    4.2.0-0.okd-2019-06-25-110619   True        False         False      66m

 

Conclusion

For a first attempt on a developer preview things went very well. I’ve trolled through the Azure logs and found things like access role issues so I still don’t know if I’ve made a mistake on my Service Principal allocation. I think some better error handling and messages would help with the installer. I’d hate to see things like Machine Sets not being able to be expanded because my IAM is wrong and I didn’t know about it. Ofcourse general things like installation behind proxy, bring your own DNS or SecurityGroups/Networking and better publicising of the CoreOS images would also help.

I’m hoping to find out more as I use the cluster over the next few days. If you haven’t yet, try the installer on Azure and let me know what you think:

  1. To get started, visit try.openshift.com and click on “Get Started”.
  2. Log in or create a Red Hat account and follow the instructions for setting up your first cluster on Azure.

References

PAM – Pluggable Authentication Modules for Linux and how to edit the defaults

Most of us have been using PAM when authenticating without really thinking about it, but for the few of us that have actually tried to make sense of it, PAM is the partner that always says “no”, unless otherwise stated. It’s the bane of any sysadmin’s existence when it comes to making system x secure, and it becomes a major pain point on and off when I forget about the normal rules of engagement.

Rules of Engagement

Session windows

To engage with PAM in any combative situation, please ensure belts and braces are on, and keep your arms and legs inside the vehicle at all times. Backup the /etc/pam.d/ directory, and make sure that you have one or two non-terminating sessions open on your system – ideally a console, and an ssh session.

The unexpected overwrite

In RHEL/Fedora like systems, PAM is configured to have two main files which are included by the rest of the PAM configuration: /etc/pam.d/system-auth-ac and /etc/pam.d/password-auth-ac.  Normally, system-auth and password-auth in the same /etc/pam.d directory are links to the above files. authconfig tools will overwrite the configuration in the files with a suffix of -ac. This means that if the changes need to be persistent and not overwritten, the symlinks can be set to the new location As follows:

ls -l /etc/pam.d/*-auth

lrwxrwxrwx. 1 root root 19 Feb 19 12:57 /etc/pam.d/fingerprint-auth -> fingerprint-auth-ac
lrwxrwxrwx. 1 root root 16 Feb 19 12:57 /etc/pam.d/password-auth -> password-auth-ac
lrwxrwxrwx. 1 root root 17 Feb 19 12:57 /etc/pam.d/smartcard-auth -> smartcard-auth-ac
lrwxrwxrwx. 1 root root 14 Feb 19 12:57 /etc/pam.d/system-auth -> system-auth-ac
[root@node1 pam.d]# rm password-auth
rm: remove symbolic link ‘password-auth’? y
[root@node1 pam.d]# ln -s password-auth-local password-auth

The Language

When the usual PAM translator (authconfig) is not enough to achieve the right system authentication, one has to start thinking about communicating directly with it. PAM has four different keywords for controlling authentication with the system:

awk '!/^[#\-]/{print $1}' /etc/pam.d/*  | sort | uniq

account
auth
password
session

auth is used for providing some kind of challenge/response (depending on the module)  – usually username/password.

account is used to time or otherwise restrict the user account -i.e. user must use faillock, load all the sssd account requirements etc.

password is used to update the authtoken associated with the user account. This is mainly used to change passwords and it can be where the rules around local password strength can be formulated.

session is used to determine what the user needs before they are allowed a session: working home directory, to which system limits apply (open filehandles, number of terminals, etc.), and user keyring.

Each of these keywords has  these common modes:

required - if this fails return failure but continue executing anyway
[success=ok new_authtok_reqd=ok ignore=ignore default=bad]

requisite - if this fails return failure and die (don't even attempt to preauth)
[success=ok new_authtok_reqd=ok ignore=ignore default=die]

sufficient - this is enough for success and exit if nothing previously has failed
[success=done new_authtok_reqd=done default=ignore]

optional - we don't care unless this is the only module in the stack associated with this type
[success=ok new_authtok_reqd=ok default=ignore]

PAM execution stack

PAM executes everything sequentially unless told otherwise. The following snippet of password-auth-ac will:

# set environment variables
auth required pam_env.so
 
# delay (ms) by this amount if last time this user failed
# otherwise if absent check login.defs for specified delay
auth required pam_faildelay.so delay=2000000
 
# check user authentication from the system 
# try_first_pass - use the password that has already been entered if any
# nullok allow - allow blank password
auth sufficient pam_unix.so nullok try_first_pass
 
# succeed if uid is greater than 1000 and don't log success to the system log
auth requisite pam_succeed_if.so uid >= 1000 quiet_success
 
# deny everything else
auth required pam_deny.so

If anything is sufficient and it succeeds then the execution stack exits for the component – i.e. auth successful when local user signs in with username/password.

PAM if statements

PAM doesn’t have very obvious if statements, but given the right parameters it allows jumps of execution. Below is a way of incorporating an SSSD back-end with PAM to allow users with IdM logins access to the system:

# check if the user is allowed to log in with preauthorisation (i.e. has faillock entries)
auth        required      pam_faillock.so preauth silent audit deny=5 unlock_time=900
 
# skip two rules if successful 
# NOTE: ​default ignore  means sufficient
# and check if it's a unix user - use the password provided by the auth stack
auth        [success=2 new_authtok_reqd=done default=ignore] pam_unix.so try_first_pass
 
# if it's not a unix user, then use sssd backend for logging in
auth        sufficient   pam_sss.so forward_pass
 
# otherwise fail 
auth        [default=die] pam_faillock.so authfail audit deny=5 unlock_time=900
 
# this is the skip step from pam_unix module
# it allows for resetting the faillock when necessary
auth        sufficient    pam_faillock.so authsucc audit deny=5 unlock_time=900

The comments in line explain what each module is doing. The execution sequence reminds me a bit of jump statements in assembly language and it helps me to think about them in that manner.

Putting them all together gives us this auth section:

auth        required      pam_env.so
auth        required      pam_faildelay.so delay=2000000
auth        required      pam_faillock.so preauth silent audit deny=5 unlock_time=900
auth        [success=2 new_authtok_reqd=done default=ignore] pam_unix.so try_first_pass 
auth        sufficient pam_sss.so forward_pass
auth        [default=die] pam_faillock.so authfail audit deny=5 unlock_time=900 
auth        sufficient pam_faillock.so authsucc audit deny=5 unlock_time=900
auth        required      pam_deny.so

If we were to make slight adjustments to the above snippet, it may have the frightening effect of allowing users to log in without having the correct password:

auth        required      pam_env.so
auth        required      pam_faildelay.so delay=2000000
auth        required      pam_faillock.so preauth silent audit deny=5 unlock_time=900
# reducing this number from 2 to 1 (success=1)
auth        [success=1 new_authtok_reqd=done default=ignore] pam_unix.so try_first_pass
auth        sufficient pam_sss.so forward_pass
# swapping these two lines
auth        sufficient pam_faillock.so authsucc audit deny=5 unlock_time=900
auth        [default=die] pam_faillock.so authfail audit deny=5 unlock_time=900
auth        required      pam_deny.so

Conclusion

PAM is a very powerful, yet quite obscure tool. It can be configured to allow people in without even a valid password, or it can deny everyone access apart from every alternate Tuesday between 19:00 and 20:00 (in combination with other tools). Whenever I have configured it, I have found it useful to test for access allowed, access denied, and access locked in order to ensure predictable operation.

NOTE: To get more information around PAM, visit the man pages: (5) pam.conf , (8) pam ,  (5) password-auth

Identity Management (Continued) – The AD integration

Thoughts about AD integration with IPA

Have you ever spent days wondering what would happen if the Windows guys talked to the Linux guys and visa versa? It’s odd; organizations decide to manage identity with Active Directory and thus the Linux server estates come in as an afterthought, an add-on that needs to be integrated – and maintained – alongside the Windows boxes.

Centralised authentication is one of those integration points that someone like me has to address. A preexisting Active Directory is usually (and regrettably) the cornerstone for this concept, as it provides authentication to the desktop estate where end users usually remain, for better or for worse. Introducing Linux central authentication to servers and to users is something wonderful. Something to be celebrated by a party! No longer would you manage individual users and passwords, no longer would you have to go through individual servers deleting previous sysadmins that have pained and upset you by leaving you in this mess and no longer would you have the arduous task of clearing up after them. You can remove them from your centrally managed Linux authentication once and for all.

In fact, you can do one better than that. If your company is managing users via AD, and you are integrating AD with IPA you have the possibility to offload this task to the Windows team. They can go and disable the user from AD. All you have to do is sit back and relax.

The Integration Choice

I know my thoughts wandered at this point. All the different possibilities. How best to integrate, what are the benefits of the different options, why would people care. Decision time!

Assuming that you already decided on using IPA and have one up and running and ready to go, you will want to consider the following two possibilities:

  • the user sync from AD to IPA
  • the cross realm trust

User synchronizing from AD with IPA may leave you with user name collision, if you have similarly named users. It also doesn’t allow SSO and unique identities but it does give you the option of having two different passwords in the windows and Linux realms. Some may consider that a plus. On the down side, user management on putting users into groups is also necessary because all the users are synchronized but not the AD groups.

I prefer the cross realm trust route. It means that user management falls completely under the S.E.P field. All I need to care about is which groups of users does the company trust enough not to harm themselves and others when using Linux. Each realm would be responsible in authenticating its own users and the groups the groups that the users are in would allow them to be authorized for certain actions.

NOTE: If you don’t have an IPA server ready to go, you have very limited time, you’re not allowed to run 2 different realms, you don’t care about policy enforcement, consider integrating directly with AD. SSSD and kerberos should help you manage different users from AD and your Linux boxes talk directly to the AD forest.

The rocky road

I already have an installation that I can use for my integration. The previous article showed how to disable most IPA features. This part is showing how to live with your choices with IPA 4.1 and AD 2012R2.
IPA should have a netbios name as well as samba and winbind services configured:

ipa-adtrust-install --netbios-name=MGMT

Enable the following ports:

 firewall-cmd --permanent --add-port={138/tcp,139/tcp,445/tcp,138/udp,139/udp,389/udp,445/udp}
 firewall-cmd --reload

Add the service records to DNS. This is one of the things that allows AD to talk to IPA on a native level:

_ldap._tcp.dc._msdcs IN SRV 0 100 389 idm
_kerberos._tcp.dc._msdcs IN SRV 0 100 88 idm
_kerberos._udp.dc._msdcs IN SRV 0 100 88 idm
_ldap._tcp.Default-First-Site-Name._sites.dc._msdcs IN SRV 0 100 389 idm
_kerberos._tcp.Default-First-Site-Name._sites.dc._msdcs IN SRV 0 100 88 idm
_kerberos._udp.Default-First-Site-Name._sites.dc._msdcs IN SRV 0 100 88 idm

And create the trust (Yes, in IPA 4.1 you do have to stop the firewall and then add the trust, there are ports that aren’t mentioned in the docs):

systemctl stop firewalld
ipa trust-add --type=ad "windows.local" --admin Administrator --password
systemctl start firewalld
ipactl status
reboot

Check that the trust exists by finding the domains

[root@idm ~]# ipa trust-find windows.local
---------------
1 trust matched
---------------
  Realm name: windows.local
  Domain NetBIOS name: WINDOWS
  Domain Security Identifier: S-1-5-21-4218785893-350090421-2374357632
  Trust type: Active Directory domain
----------------------------
Number of entries returned 1
----------------------------

OK, now one kerberos realm trusts the other and this is how it should look:

Cross Realm Trust
Cross Realm Trust

The group mapping

To enable your AD users to log on to clients of IPA, you need to tell IPA how to find them. To do that you will need to add the AD group where your user is as an external group in IPA. Let’s assume you want to allow the AD Domain Admins group to ssh on to servers.

 ipa group-add --desc='AD Domain admins' domain_admins_external  --external
 ipa group-add-member domain_admins_external --external 'WINDOWS\Domain Admins

These groups won’t have any POSIX attributes as they are external groups. Every user that connects to a Linux computer requires a Linux UID and GID. This means that you will need to create another group to contain the mapped external one:

ipa group-add --desc='AD Posix Domain Admins' domain_admins
ipa group-add-member domain_admins --groups domain_admins_external

Great! You can now start configuring your sudo, host based access control and Identity Views!

Identity Management (IPA) – The ‘on the side’ installation

Thoughts about IPA installation

IPA or IdM in its Red Hat productised form is a very neat product. It allows centralised authentication and policy management while providing that over secure channels (kerberos and TLS). IdM provides quite a few features and you may decide that you’re better off without some (saving the extra calories/effort for later) as your infrastructure may already provide those on the side.

This example installation is without DNS, without a CA, and without NTP (VM installations shouldn’t really be running NTP anyway).

Once you’re past the stage of convincing management that it’ll be good for you (and good for them) to allow this installation to happen, this is what you need to think about and discuss with the team managing Certificate Authorities, NTP servers, and DNS:

  • DNS – DNS zones need to be configured in such a way that IPA acts as a KDC to its own group of servers if there are existing KDC in a different realm in the environment, they will need to be in a different subdomain/domain. The SRV records will only return the IPA servers when queried about kerberos in this subdomain.
  • Certificates – IPA uses SSL for ldap and http. IPA could be acting as a Certificate Authority but not in this instance. Active Directory (or something else) may already be configured as a Certificate Authority which could allow you to present your windows team with a certificate request from IPA to sign in order to obtain a valid web certificate.
  • Time – a uniform time source across the estate IPA servers and clients. Think about business meetings, SSL, sex, humour, and trains – all require good timing.

NOTE: IPA/IdM used to have to provide certificates by default to its clients on installation. As this is no longer the case, IPA can be installed without a CA in an easier fashion than you’re used to. Give it a try.

Prerequisite Checking for IPA installation

NOTE: The following installation is for IPA version 4.1 and AD version 2012R2.

Check that you have:

  • Access to the right software packages via yum (normal RHEL/CentOS base repo should do)
  • Forward and reverse resolvable hostname
  • An entry in the /etc/hosts with the ip address and hostname
  • nscd off
  • An up-to-date OS installation
  • Still got your sanity (test to be performed by an external third party)

The certificate creation

Create your secret private key for your server. Here, we are using openssl to generate a private key:

mkdir /root/certs 
openssl genrsa -out /root/certs/http.$(hostname).key 2048

And then, the certificate request below:

And then the certificate request below:
[root@idm certs]# openssl req  \
-key /root/certs/http.idm.mgmt.linux.local.key \
-out /root/certs/$(hostname -f).csr -new

You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [XX]:GB
State or Province Name (full name) []:Norfolk
Locality Name (eg, city) [Default City]:Norwich
Organization Name (eg, company) [Default Company Ltd]:MGMT.LINUX.LOCAL
Organizational Unit Name (eg, section) []:She ITs And Giggles
Common Name (eg, your name or your server's hostname) []:idm.mgmt.linux.local  
Email Address []:root@localhost 

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:SheITsAndGiggles
An optional company name []:

Your AD certificate Authority should be able to sign this csr and then retrieve the CA chain with the cert for IPA installation. An example on how to sign certificates in a windows CA can be found here: Generate a Digital Certificate from CSR.

To examine the Certificates, open them as follows:

openssl x509 -in <certificate> -text

The Installation

The following step just downloads the software on the server. It doesn’t start any services:

yum -y install ipa-server ipa-server-trust-ad

Now for the fun part:

ipa-server-install --http-cert-file /root/certs/http.idm.mgmt.linux.local.crt \
--http-cert-file /root/certs/http.idm.mgmt.linux.local.key \
--http-pin SheITsAndGiggles \
--dirsrv-cert-file /root/certs/http.idm.mgmt.linux.local.crt \
--dirsrv-cert-file /root/certs/http.idm.mgmt.linux.local.key \
--dirsrv-pin SheITsAndGiggles --ca-cert-file ca-chain.p7b \
 -n mgmt.linux.local -r LINUX.LOCAL --mkhomedir

NOTE: I have used the same certificate and key for the http and directory servers. The p7b file that has been downloaded from the CA is the chain.

After the installation, you will need to open all the ports for the services that we are running and add some DNS entries to advertise those services:

systemctl enable firewalld
firewall-cmd --permanent --zone=public \
--add-port={80/tcp,443/tcp,389/tcp,636/tcp,88/tcp,464/tcp,88/udp,464/udp}
firewall-cmd --reload

To tell your IPA clients what you are serving you need to advertise the services via DNS. Find an example below:

_ldap._tcp IN SRV 0 100 389 idm
_ldap._udp IN SRV 0 100 389 idm
_kerberos IN TXT MGMT.LINUX.LOCAL
_kerberos._tcp IN SRV 0 100 88 idm
_kerberos._udp IN SRV 0 100 88 idm
_kerberos-master._tcp IN SRV 0 100 88 idm
_kerberos-master._udp IN SRV 0 100 88 idm
_kpasswd._tcp IN SRV 0 100 464 idm
_kpasswd._udp IN SRV 0 100 464 idm

Check that the installation is running as it should by getting kerberos credentials for your admin user and using admin to ssh on ipa:

[root@idm ~]# kinit admin
[root@idm ~]# ssh  admin@$(hostname -f)
Creating home directory for admin.