Many of you are probably familiar with the moment in Life of Brian when John Cleese lists all the improvements the Roman Empire has brought to Judea then asks, “What have the Romans ever done for us?” I mention this because, one day recently, I was scrolling through twitter when I stumbled onto a treasure trove of users blaming Kubernetes for its complexity, claiming things were better before it existed. So, in the interest of fair play, I’ve decided to write a bit about the glorious past and our mediocre present, and to compare the two.
Engineers enjoy identifying problems in architecture and blaming technology for overcomplicating things. Yes, we absolutely do; don’t bother denying it. Only, sometimes we forgot about the problems these adaptions exist to solve, and why it’s nonetheless worth the effort to invest in different platforms.
This article is mostly about the (bad) experience of managing infrastructure and doing DevOps/SRE jobs at several companies. Of course, these anecdotes may not be applicable to today’s infrastructure, but I’m sure lots of you have similar memories you’ve long repressed that this blogpost will help reawaken.
The Aqueduct 🔗︎
Every system starts with basic infrastructure. You need to know where and what kind of machines you have. It does not matter whether this architecture is physical or virtual or a laptop on your desk (I’ve seen more production services than you’d believe that were running off a notebook). You need other pieces of basic information about this infrastructure, and I don’t mean monitoring, which comes later. Just the simple things: where are they? Are they operational? What kind of processes are they running? This may seem trivial but before we had a platform, we had countless custom-made tools that did this exact thing.
After reaching a certain number of machines in your infrastructure, you probably started using Infrastructure Automation tools like Ansible, Salt, and Chef. Each of these had its own way of operating. They kept inventories up-to-date and provided an interface to interact with the machines without manually SSHing into them. You may even have written (at least we did) custom integration to scale your workload in the cloud or check the integrity of your systems. After awhile, some of those machines got unpinned from that kind of automation (this is the point where the cattle service model became popular). They had a fixed version of the software. They became so specialized and delicate that, after a certain amount of time had passed, no one dared to upgrade or reboot them (been there, seen that).
In Kubernetes, everything is constantly reminding you of how disposable all your resources are. You never have to anchor yourself to a machine just because it’s functional. You use them for raw power, nothing more. And checking inventory is an option that’s always at your fingertips:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-103.us-east-2.compute.internal Ready <none> 8d v1.17.7-eks-bffbac
ip-10-0-1-225.us-east-2.compute.internal Ready <none> 8d v1.17.7-eks-bffbac
ip-10-0-2-145.us-east-2.compute.internal Ready <none> 8d v1.17.7-eks-bffbac
Roman medicine 🔗︎
This might seem trivial, but I remember, back in the heyday of the containerized world, when all the pieces had to be set in perfect alignment for an application to run properly. Different SDKs, custom-built libraries, all were part of a seeming culture-wide revolt against seamless upgrades. We had to spend a serious amount of time on careful planning, utilized a variety of sandbox environments to go through the checklist of potential failure scenarios (and we still missed some). Now compare this with building and testing a container and, afterward, using just one command to update it. That alone is worth the time it takes to invest in Kubernetes. The concepts of Deployment and Replicaset were born out of a need to handle upgrades and rollbacks in a standardized way. The cherry on top of all this is that Kubernetes provides a REST API for all of these resources.
$ kubectl set image deployment nginx nginx=nginx:1.9.1
And the roads! (well, yeah, obviously the roads; the roads go without saying, don’t they?) 🔗︎
Another longstanding point of contention was service discovery and DNS. On a bare-metal infrastructure, we had custom DHCP servers and DNS (bind, dnsmasq, unbound, and even some custom) instances to manage this simple yet complex feature. We managed public records, internal records, and custom DNS responses based on application metrics, and, yes, sometimes we spent the night debugging what went wrong. Kubernetes’ built-in DNS and the Service concepts solved most of these problems. We have limited but well determined (if you consider round-robin deterministic) resolve rules. Moreover, changing the underlying network drivers is now trivial. You can choose any provider that supports the CNI interface.
Even better, the overlay networks and network policies simplify the network stack for developers. You don’t meet with VLANs, encapsulation, custom routing rules (unless you’re responsible for Kubernetes’ operation). The default behavior is such that everything is connected. If you need to separate things out, you only have to define network policies, and there you go. Easy to understand and even easier to audit.
Public order 🔗︎
Jobs and Cronjobs are another pitfall of legacy infrastructure. Usually, there is a node (or a couple of nodes) that run jobs. They are usually connected to all kinds of networks (maybe not initially but definitely in the long run). They run garbage collection on temporary resources, or run backups or all kinds of other stuff. In a rapidly evolving environment, it’s hard to track what’s running, and where. In my experience, if you need something fast you don’t bother to automate it. This is not a problem, in the end, since our job is to keep the services online. Yet, sometimes an important job stays in the grey zone. Once we had a case where we were debugging for missing analytics data; a job was supposed to have transported data from A to B. We scanned all our automation and found no evidence that such a process existed. We started to wonder how this thing had worked for as long as it had. Checking the timeline, we found that an irrelevant machine had rebooted when the data started drying up. Further investigation revealed that a shell script running in screen command was responsible for performing the task.
Kubernetes has a standard way of handling these problems. It does not tie jobs to nodes. You can list create, delete jobs like any other resources: no more running shady scripts inside a screen.
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: awesomeJob
image: busybox
args:
- /bin/sh do the thing
restartPolicy: OnFailure
And the wine! (Yeah. Yeah! That’s something we’d really miss if the Romans left.) 🔗︎
There are tons of stories like these. Everybody has a favorite or most hated aspect of this technology. We can go on and on with Kubernetes features like quotas, security policies, etc. but I don’t want to write a Kubernetes feature blog. I just wanted to gather some old stories that would have turned out differently if we were using Kubernetes. And remember! The thing you have to ask yourself is, apart from the sanitation, the medicine, education, wine, public order, irrigation, roads, the freshwater system and public health, what have the Romans ever done for us?