Upgrading Kubernetes: 8 years of production

Author: Matt Burgess | Posted on: November 8, 2023

How we successfully keep multi-tenanted production clusters up to date.

CECG engineers have been running production Kubernetes clusters for one of our clients for over 8 years now. For context, that’s roughly around the time of the Kubernetes-v1.0.0 release and, as of the time of writing, Kubernetes-v1.28.0 is the latest GA version!

Those production clusters support several thousand developers split across several hundred development teams who are all aiming to get their latest builds out to users.

To take advantage of new features and bug fixes, we must regularly upgrade the version of Kubernetes that those clusters run. Upstream Kubernetes and the various Managed Kubernetes services (EKS, GKE, AKS) have specific support lifecycles which we need to adhere to to ensure ongoing support assistance.

This article details our experiences of trying to achieve the above, issues we’ve hit along the way, and some suggested timelines and approaches you may want to consider for your own clusters.

Tradeoff: Kubernetes Feature Availability vs. Upgrade Cadence

When a new version of Kubernetes deprecates or removes a particular API, all consumers of that API need to be updated. An example was in v1.22, where a lot of long-deprecated APIs were removed, including extensions/v1beta1 and networking.k8s.io/v1beta1 versions of ‘Ingress’.

In order to help development teams onboard onto our clusters as quickly and easily as possible, we have developed some custom deployment tooling so that they don’t have to write Kubernetes manifests directly. In addition, it ensures that teams comply with certain policies (security requirements, PodDisruptionBudgets, etc.) by default, all of which help us to keep our clusters performant and running smoothly. On the face of it then, this would seem like an ideal scenario; we have a single consumer of the Kubernetes APIs and we, as authors of that deployment tool, can update it as and when those APIs are deprecated and removed. However, there are two important caveats:

Tenants need to upgrade the version of the deployment tool to pick up those API changes. Admittedly, this should be a much easier request to see completed than having to update potentially many manifests, but we still ultimately have a dependency on an external team.
Tenants aren’t obligated to use the deployment tool. One of our primary design philosophies is that we should strive to give our tenants the ability to utilise any Kubernetes feature they wish to get their job done. Within reason, of course! If we were to make the deployment tool the only way in which tenants interact with the clusters, they would be unable to experiment with new features until we were able to make time to update the tooling. Additionally, tenants may need to write their own manifests as an ’escape hatch’ in case there are specific reasons that the default security or PDB policies, for example, aren’t suitable for their particular application.

On balance, we believe that making new features available to tenants outweighs our dependency on them to facilitate migration to new Kubernetes APIs.

Until relatively recently, we had very limited visibility into how widespread deprecated and to-be-removed APIs were being used. Therefore, it was hard to gauge the impact that any particular upgrade was going to have. We have since adopted kube-no-trouble which analyses your clusters and shows any resources that are using APIs that have been deprecated or removed in future Kubernetes versions. Whilst it only provides text output suitable for ad-hoc runs out of the box, we integrated it with Prometheus using kube-no-trouble#302 almost verbatim. This has enabled us to get visibility of upgrade-blocking issues on a per-team and per-namespace level and provide tenant-facing dashboards so that they can easily monitor their compliance status.

Other Upgrade Considerations

Upgrades to Kubernetes may require changes to other components which necessitate pre-upgrade work. For example, 1.24 required a move away from the dockershim container runtime to an OCI-compliant runtime, e.g. containerd. Similarly, 1.27 requires the installation of a CSI storage driver when the equivalent in-tree storage plugin is removed. Even though these examples are unlikely to require coordination with tenants, such preparatory work can take significant effort. Hence, they need to be analysed and planned in, with plenty of lead time for the actual upgrade.

Upstream Kubernetes Support Schedule

Upstream Kubernetes releases receive approximately 12 months of support. Full details of the support schedule are available here .

However, if you are using an upstream Kubernetes release (i.e. you are operating “turnkey” clusters), you will also need to take into account Kubernetes’ version skew policy . This dictates how much difference there can be between the versions of the components that the control plane runs compared to the versions of the components that the Kubernetes nodes run.

In short:

API servers: These can be between 1 minor version of each other in an HA setup; this is to support upgrades. e.g. You’re allowed to do a rolling upgrade such that API servers are on a mix of 1.28 and 1.27. But a mix of 1.28 and 1.26 is not allowed.
Nodes: From 1.25 onwards must be within 3 minor versions of kube-apiserver. e.g. kube-apiserver is at 1.28, nodes are permitted to be running 1.28, 1.27, 1.26, or 1.25. Prior to 1.25 nodes must be within 2 minor versions of kube-apiserver.

As you can see from the above, the skew policy allows for nodes to not be upgraded on every release. Whilst it is recommended to keep both nodes and API servers up to date, there may be circumstances that dictate otherwise. For example, if your clusters are sufficiently far behind the latest release, skip node upgrades in order to ‘catch up’ more quickly to supported versions.

Vendor Support Schedules

The table below consolidates data taken from the EKS Kubernetes Release Calendar and GKE Release Schedule at the time of writing. Its inclusion here is merely illustrative of the general support policy. For simplicity, we only consider the GKE ‘Regular’ release channel dates as that channel is recommended for most customers and its release/support dates more closely align with EKS for the most recent releases.

Kubernetes version	Upstream release	AWS EKS release	GCP GKE (Regular) release	Upstream end of support	AWS EKS end of support	GCP GKE end of support
1.27	April 11, 2023	May 24, 2023	June 14, 2023	June 28, 2024	July 2024	August 31, 2024
1.26	December 9, 2022	April 11, 2023	April 14, 2023	February 28, 2024	June 2024	June 30, 2024
1.25	August 23, 2022	February 22, 2023	December 8, 2022	October 28, 2023	May 2024	February 29, 2024
1.24	May 3, 2022	November 15, 2022	August 19, 2022	July 28, 2023	January 31, 2024	October 31, 2023
1.23	December 7, 2021	August 11, 2022	April 29, 2022	February 28, 2023	October 11, 2023	July 31, 2023

That, then, translates to the following elapsed time, in months, from the upstream release::

	k8s-1.23	k8s-1.24	k8s-1.25	k8s-1.26	k8s-1.27
EKS Release	8	6	5	4	1
GKE Release	4	3	3	4	2
Upstream EOL	14	14	14	14	14
EKS EOL	22	20	21	18	15
GKE EOL	19	17	18	18	16

As you can see, the timespan between an upstream release and its adoption in EKS and GKE is somewhat variable. However, it looks like both EKS and GKE aim to have releases available within 6 months of an upstream release being made. Both vendors now support a given release for 14 months from it initially being made available, as does the upstream Kubernetes project.

Recommendations

Weigh up the tradeoff between feature availability and dependency on tenants yourself. Some factors you may want to consider: a. Number and scale of your clusters b. Familiarity of your tenants with Kubernetes c. Willingness/agreement to break tenant deployment pipelines
Specifically related to the last point above, consider adopting a contract that codifies expectations between you and your tenants when they onboard onto your platform. e.g. “We, as cluster operators, will provide you, the tenant, with x months’ notice to adapt to upstream Kubernetes API changes and provide support for those migrations. You, as a tenant, commit to adapting to such API changes within that time frame”. Having that agreement in place can be beneficial when escalating upgrade-related blockers with development teams.
Announce API deprecations and removals as early as possible. The Kubernetes Deprecated API Migration Guide is updated ahead of each upstream release. Keeping a close eye on that and letting tenants know of changes well ahead of your planned upgrade date will maximise the time they have to get any required code fixes in place.
Upgrade schedule: Bearing in mind all of the points above, and the various vendors’ release and support schedules, we’d currently recommend the following upgrade schedule; it’s designed to balance maximising longevity of support on any given release on the assumption you have both turnkey clusters and cloud-provider managed clusters as we currently do.

The schedule will provide for up to 10 months of running your “turnkey” clusters on an upstream-supported version of Kubernetes, and likely around 12-13 months of support on EKS & GKE clusters by the time you reach production depending on your own rollout timescales.

Timescale	Action
Continuous	Monitor upstream Changelogs, blog posts, and deprecation guides for API deprecation and removal announcements. Cross-check with `kube-no-trouble` and coordinate with tenants to remove usage of those APIs.
Between "Upstream release -2 months" and "Upstream release + 3 months"	Ensure all upgrade prerequisites are met in production clusters.
Upstream release +4 months (depending on EKS + GKE availability)	Upgrade pre-dev clusters to upstream release
Between "Upstream release +4 months" and "Upstream release + 6 months"	Upgrade dev, test and production clusters to upstream release

Summary

Keeping multi-tenanted Kubernetes clusters up to date with upstream releases in order to obtain new features, bug fixes and security fixes can be a daunting task, especially given how frequent those releases are. However, with a clear understanding of commitments on the part of both tenant and cluster administration teams, and upgrade schedules that align with upstream’s release cycles, we believe that such upgrades can and should be treated as business-as-usual activities rather than the often-times fear-inducing major projects that they can sometimes feel like.