Leveraging ArgoCD in Multi-tenanted Platforms

ArgoCD is a popular Kubernetes-based declarative deployment tool, and for good reason! It has a polished Web and CLI-based UI, both supported by a general ArgoCD API and a flexible RBAC model. It’s designed to get you going quickly, leveraging familiar tooling such as Helm or Kustomize to manage the Kubernetes manifests you deploy to your clusters.

flowchart LR
    subgraph NS_ARGOCD["<code>Namespace</code> **argocd**"]
        ArgoCD[ArgoCD]
        ARGO_APP1["<code>Application</code><br />**app-1**"]
        ARGO_APP2["<code>Application</code><br />**app-2**"]
    end

    subgraph NS_APP1["<code>Namespace</code><br />**app-1**"]
        APP1[k8s<br />Manifests]
    end

    subgraph NS_APP2["<code>Namespace</code><br />**app-2**"]
        APP2[k8s<br />Manifests]
    end

    ArgoCD --reconciles--> ARGO_APP1
    ArgoCD --reconciles--> ARGO_APP2
    ARGO_APP1 --defines<br />via helm--> APP1
    ARGO_APP2 --defines<br />via kustomize--> APP2
    ArgoCD -.manages.-> APP1
    ArgoCD -.manages.-> APP2

A typical ArgoCD deployment, reconciling multiple Application objects and the resources they define (via helm and kustomize) in different namespaces.

In our experience across multiple clients, we’ve seen some pitfalls that are easily fallen into when introducing ArgoCD, especially when it comes to multi-tenanted platforms. At CECG, whenever we build a platform, introduce a technology, or add a new capability, we have a set of Core Principles that guide our design and implementation, including:

Tenant Autonomy – Tenants should not be blocked by approvals from Platform Operators unless strictly required. They should have as much autonomy as possible within their owned areas.
Tenant Isolation – Tenants should never be able to impact other tenants by exercising autonomy within their owned areas.
Automated Progressive Continuous Delivery – All releases, by tenants and platform operators, should progressively roll out to validate and gain confidence before being deployed in front of end-users

With these principles in mind, we can discuss the lessons we’ve learned when leveraging ArgoCD and how we navigated the pitfalls.

Tenant Autonomy and Isolation

Often, we see ArgoCD deployed and configured in such a way that it actively prevents tenants from managing Application and ApplicationSet objects themselves, requiring the coordinated use of shared namespaces typically via an approval-based process, creating a bottleneck dependency on a central team and breaking tenants’ autonomy. This bottleneck often acts as a limiter to scaling the operation of core platforms where you ideally should be able to support hundreds of teams and thousands of applications with a lean core platform team.

flowchart LR
    subgraph TENANTS["Tenant Teams"]
        T1["Team 1"]
        T2["Team 2"]
        TX["..."]
        TN["Team N"]
    end

    R["Shared Namespace<br />Owner"]
    T1 --<code>Application</code><br />Pull Requests--> R
    T2 --<code>Application</code><br />Pull Requests--> R
    TX --<code>Application</code><br />Pull Requests--> R
    TN --<code>Application</code><br />Pull Requests--> R

    R --Responsible for<br />merges into --> REPO["Application Repository"]

A model with shared usage of a namespace being coordinated by a single team creates a bottleneck limiting how many tenants can be supported.

Due to the way ArgoCD’s RBAC model works, there is a knock on security issue where all Application objects are freely able to acquire the same permissions as any other Application in the same namespace. These permissions include the namespaces resources can be managed in and the types of Kubernetes resource that can be managed. This breaks tenant isolation where two tenants who are placing Applications in the same namespace could acquire each other’s permissions unless whoever approves the changes is aware of all team members when approving every change.

In our multi-tenanted Kubernetes-based Core Platforms, we typically map a Tenancy to a set of owned Namespaces and tenants are given full control over namespace-scoped Kubernetes objects within those namespaces. Further in this article, we’ll seek to capitalize on this setup to avoid the aforementioned isolation and autonomy issues.

Continuous Deployment vs Delivery

The keen-eyed amongst you may notice that I did not introduce ArgoCD as a GitOps or Continuous Delivery tool specifically, opting to class it as a Kubernetes-based declarative Deployment tool. The reasons for this are:

ArgoCD, by itself, gives you continuous reconciliation of a single thing you want to ensure is deployed to some cluster, not a full end-to-end delivery with validation and promotions
ArgoCD can be leveraged entirely without any Git repos being involved.

This is not to diminish ArgoCD in any way, it is very capable in its intended functionality, and the usage of Git is simply one mode of operation that it supports amongst many useful options. To achieve an end-to-end Continuous Delivery with automated, progressive rollout, we need additional orchestration. This orchestration signals when ArgoCD should update a given deployment in a given cluster in the context of some rollout across multiple deployments towards its end users.

A common approach we see is introducing ArgoCD and having automated pipelines finish by updating a single Application version after successful merge which then immediately rolls out to all clusters (either via a GitOps repo, or via an API apply). The takeaway should be that ArgoCD alone will not give you Continuous Delivery, but it provides you with powerful deployment functionality as part of a Continuous Delivery end-to-end.

Now that we’ve outlined the principles we use to approach Platform Engineering and identified some of the pitfalls we’ve seen come up, we can leverage the tools ArgoCD has available to solve them!

Recommendations for Configuring ArgoCD

This brings us onto the final pieces of the picture:

Leveraging AppProjects for multi-tenancy with isolation and autonomy
We need to configure ArgoCD to watch Application and ApplicationSet objects in all namespaces
We need to lock down the default project

Argo Projects to Provide Isolation and Autonomy

In ArgoCD, when you create an Application object, it is always associated with an ArgoCD Project (an AppProject object). If you specify no project (or an invalid one), your application will be assigned to the default project. ArgoCD projects allow you to specify a number of interesting constraints on their associated applications, including:

A list of namespaces the Application objects must reside in
A list of namespaces and clusters the Application objects can manage resources in
An allow/blocklist of namespace-scoped resources that Application objects can manage
An allowlist of cluster-scoped resources that Application objects can manage

Given we typically capture a tenant as owning a set of namespaces (and a namespace is never actively owned by more than one tenant), we can directly map this into the creation of one AppProject per tenant which:

Allows Application (and ApplicationSet) objects to reside in any namespace owned by the tenant
Allows Application objects to manage Kubernetes resources in any namespace owned by the tenant
Allows the tenant team to manage only permitted namespace and cluster-scoped Kubernetes objects

Cluster scoped permissions

By default, we prohibit tenants from managing any cluster-scoped objects. Managing cluster-scoped objects should be seen as a privileged operation that should be carefully reviewed as it can easily break isolation.

The image below shows an example of what this might look like with two tenants ("Tenant A" and "Tenant B") each with two namespaces ("A1", "A2", "B1", "B2"):

flowchart TD
  subgraph ARGOCD[ArgoCD Namespace]
    direction LR
    PROJECT_A[Tenant A<br />AppProject]
    PROJECT_B[Tenant B<br />AppProject]
  end

  A[Tenant A]
  subgraph NS_A_1[Namespace A1]
    APP_A1[Application]
  end
  subgraph NS_A_2[Namespace A2]
    APP_A2[Application]
  end

  A -- owns --> NS_A_1
  A -- owns --> NS_A_2
  APP_A1 -- belongs to --> PROJECT_A
  APP_A2 -- belongs to --> PROJECT_A

  B[Tenant B]
  subgraph NS_B_1[Namespace B1]
    APP_B1[Application]
  end
  subgraph NS_B_2[Namespace B2]
    APP_B2[Application]
  end

  B -- owns --> NS_B_1
  B -- owns --> NS_B_2
  APP_B1 -- belongs to --> PROJECT_B
  APP_B2 -- belongs to --> PROJECT_B

Multiple Applications referencing the respective AppProject for their tenancy.

In this scenario:

The Tenant A AppProject lists namespaces A1 and A2 as the permitted namespaces for Applications to reside in, and for Kubernetes manifests to be managed in
The Tenant B AppProject lists namespaces B1 and B2 as the permitted namespaces for Applications to reside in, and for Kubernetes manifests to be managed in

Assuming Tenants A and B can only create Kubernetes objects in their respective namespaces, enforced by Kubernetes RBAC, any Application object created in those namespaces must reference an AppProject which accepts applications from that namespace, and only tries to manage resources in the tenant namespaces.

If an Application in namespace A1 attempts to reference the Tenant B AppProject, ArgoCD would see that the Tenant B project only accepts Application objects that exist in the B1 or B2 namespace, and will report an error that it will not even attempt a deployment due to violating the permitted namespace configuration.

Wait a second...

You may notice that a new problem emerges here if your setup allows tenants to freely create and delete `Namespaces`. Each tenant's `AppProject` must be kept up to date with the dynamic list of namespaces they own. We will tackle this problem later in the [A new problem: keeping Tenant namespaces up-to-date](#a-new-problem-keeping-tenant-namespaces-up-to-date) section.

Watching Applications in all Namespaces

Configuring ArgoCD to watch all namespaces depends on how you install it, if you’re using the Helm chart, you can pass in some extra configuration via the chart values:

configs:
  params:
    # -- Configure the namespaces the AppSet and App controllers watch
    applicationsetcontroller.namespaces: "*"
    application.namespaces: "*"

applicationSet:
  # -- Create cluster role bindings allowing AppSet controller to read from all namespaces
  allowAnyNamespace: true

Locking down the default project

Earlier, we mentioned that all Applications must reference an AppProject and if that reference is missing or invalid, the default project will be used. This default project is created by ArgoCD on first startup if a default project does not already exist. The default project is configured to:

Allow Applications to reside in any namespace
Allow Applications to manage resources in any namespace in any cluster
Allow Applications to manage all namespace-scoped resources
Allow Applications to manage all cluster-scoped resources

This is a problem as it allows anybody to immediately escape the confines of their Tenant based AppProject and use ArgoCD to manage any resource it has permission to. Thankfully, you can override this default project, and ArgoCD will respect the changes because it only creates the default project when one is missing. We update the default project to the following:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: default
  namespace: argocd
spec:
  clusterResourceWhitelist: []
  destinations: []
  sourceRepos: []
  sourceNamespaces: []

This sets the list of source and destination namespaces to an empty list (== no namespaces permitted), and now any Application which ends up referencing the default project will be flagged in violation of the permitted namespaces and ArgoCD will refuse to even try to synchronize its resources.

ArgoCD as part of a Continuous Delivery flow

Now that we have a way to configure ArgoCD for multi-tenant operation, with autonomy and isolation guarantees, we can think about how to build an end-to-end Continuous Delivery flow incorporating ArgoCD as the deployment tool. For this we assume the following exists:

Each Tenant has their own set of automated pipelines (e.g., Jenkins, GitHub Actions, Tekton)
These pipelines have access to the namespaces owned by the Tenant and can authenticate and authorize to manage them
These pipelines have access to an OCI registry to be able to publish Container Images and Helm Charts
All Kubernetes clusters have access to pull images and helm charts from the OCI registry

With these in place, we can imagine a simple Continuous Delivery flow like the following:

PR raised, tested, approved, and merged
Automated pipeline publishes a new version of the container images and Helm charts (e.g., v2.0.0)
On demand, or at a regular interval, the most recent version $v is deployed to **dev **as an ArgoCD Application
When the application is **synchronized **and healthy, launch a test suite against stubbed dependencies
Wait for all tests to pass, optionally wait to see if any alerts are fired
If everything is OK: immediately, or after a configured delay, promote version $v to integration as an ArgoCD Application
When the application is synchronized and healthy, launch a test suite against real dependencies
Wait for all tests to pass, optionally wait to see if any alerts are fired
If everything is OK: immediately, or after a configured delay, promote version $v to canary as an ArgoCD Application
When the application is synchronized and healthy, wait to see if any alerts are fired
If everything is OK: immediately, or after a configured delay, promote version $v to production as an ArgoCD Application
When the application is synchronized and healthy, $v is successfully rolled out

In CECG we call this a **Path to Production **(P2P) and we use similar approaches to manage both Infrastructure and Workloads. You’ll notice that the way ArgoCD is integrated into this end-to-end follows a repeated and identical pattern of

Deploy version $v to some environment $e by updating its’ Application
Wait for an Application to be **synchronized **and healthy

Deploy a version to an environment

Deploying a specific version of your application in this scenario involves updating the spec.source.targetRevision property of an Application object to the new version you have just published. This will trigger ArgoCD to pull the new version of your Helm chart and use it to render the Kubernetes manifests into your desired namespace(s), which also reference the new container image versions that your pipeline has published.

The simplest way to trigger an update to your Application object is to have your pipeline interact directly with the Application object either via kubectl or the argocd CLI, like this:

# Patch the desired version into an Application object
kubectl patch application <name> -n <namespace> --type='merge' -p '{"spec":{"source":{"targetRevision":"<new-version>"}}}'

# Update the entire Application object
kubectl apply -f application.yml

# Using ArgoCD CLI
argocd app set <name> -n <namespace> --revision <new-version>

Any of these options will allow you to patch the targetRevision property directly, triggering a new deployment.

The alternative to directly managing these objects is to commit an update to your Application object to some GitOps repo which ArgoCD is configured to pull in your target clusters, this approach usually entails following what ArgoCD calls as an “App of Apps” pattern in which:

All of your Application objects are stored in some repo like acme/gitops under a folder named something like apps/
You create a “root” Application which automatically deploys all YAML files found under apps/ in the acme/gitops repo to your cluster
Any changes to the YAML files under the apps/ folder are pulled from the repo into the cluster by ArgoCD

The main difference between directly managing Application objects and managing them via a Git repository is:

Your pipelines need to be given additional permissions to commit to the GitOps repo
Your pipelines need to factor in the extra time taken for a Git commit to be seen by ArgoCD and result in an update to the Application object on the cluster

For both approaches, the next step is to wait for an Application to be synchronized and healthy

Application sync and health status

After an update is triggered to an Application object, we need to give ArgoCD time to do its deployments, and wait for any managed Kubernetes resources to get into a ready state. All of this information is contained in the status property for any Application object, particularly:

The sync revision of an Application (.status.sync.revision) details the version of the Helm chart ArgoCD is trying to render
The sync status of an Application (.status.sync.status) details the result of the attempt to use Helm to render Kubernetes resources, with a state of Synced, OutOfSync or Unknown
The health status of an Application (.status.health.status) details the aggregated health of all Kubernetes resources managed by the current version of the application, with a state of: Healthy, Degraded, Progressing, Suspended, Missing or Unknown

Health checking Kubernetes resources

ArgoCD ships with out-of-the-box support for determining if well-known Kubernetes resource types (e.g., `Deployment`, `Ingress`, `Pod`) are healthy. Extensible mechanisms exist where you can configure ArgoCD with additional rules to assess the health of custom Kubernetes types, the ArgoCD [docs](https://argo-cd.readthedocs.io/en/stable/operator-manual/health/#custom-health-checks) provide guidance.

Exactly how you implement interrogation of these properties will depend on your chosen technologies/languages, but this gives an overview of how you can assess sync and health status as part of pipeline execution.

All of this together gives you a Path to Production where a version is progressively rolled out across environments and validated through different testing phases to gain confidence before it releases in front of end-users.

A new problem: keeping Tenant namespaces up to date

One issue introduced by the approach outlined in this post is that now we have a collection of AppProject objects that need to be kept up to date as tenant namespaces come and go. The frequency this needs to happen depends a lot on how much autonomy you’ve given tenants within your platforms. For a variety of reasons we have clients at both ends of the “every namespace must be manually approved” to “tenants can create as many namespaces as they want without approval.”

Where namespaces can come and go quite quickly, we identified that it would be useful to have some constant reconciliation-based approach to identifying tenants and the namespaces they owned and using that to keep the AppProject objects up to date. To this end, we created an operator named argocd-tenant-project-manager which does the following:

Scans for Namespaces that define a tenancy, based on label/annotation-based selectors (e.g., cecg.io/is-tenant: true)
Extracts the name of the tenant from the matched namespace (using the name of the namespace, or the value of a label/annotation)
Scans for Namespaces that belong to a tenancy, based on label/annotation-based selectors (e.g., cecg.io/tenant: tenant-A)
Ensures an AppProject with the same name as the tenant exists, and is configured with all of the namespaces owned by that tenant

The controller was built using Kubebuilder and allows us to quickly leverage ArgoCD at our clients in line with our principles of tenant autonomy and isolation. Look out for a post with us open-sourcing the operator soon!

Summary

Alright, let's wrap this up! First, we dove into using ArgoCD in a multi-tenanted platform. We learned that out of the box, ArgoCD's default settings can be at odds with our principles of tenant autonomy and isolation.

To fix this, we looked at using ArgoCD Projects to create boundaries. Each tenant gets their own AppProject, which defines what namespaces they can deploy to and what resources they can manage. This keeps everyone in their own lane. We also made sure ArgoCD watches for applications in all namespaces and locked down the default project to prevent any sneaky escapes.

We also covered that while ArgoCD is great for deployments, it’s not a full Continuous Delivery solution on its own. We need to integrate it into our automated workflows, like Jenkins or GitHub Actions. That means triggering ArgoCD to deploy new versions to different environments (dev, integration, etc.) and waiting for it to sync and confirm everything's healthy before moving on to the next stage.

Finally, we touched on keeping those AppProjects up to date. If you have tenants creating namespaces all the time, you'll need a way to automatically reconcile and update the projects. That's where our argo-tenant-project-manager operator comes in, which keeps everything in sync based on namespace labels and annotations.

In a nutshell, we've learned how to make ArgoCD play nice in a multi-tenanted setup, ensuring tenants have the freedom they need while staying nicely isolated. Our full CECG guidance on leveraging ArgoCD in your organization also covers topics beyond the scope of this post, including production installation and configuration, high availability, SSO integration and using ArgoCD’s custom RBAC.

If you’d like to know more about how we build Developer Platforms, Paths to Production, how we evaluate Key Technologies like ArgoCD, or how we could help you with anything Platform Engineering, reach out!

This article is provided as a general guide for general information purposes only. It does not constitute advice. CECG disclaims liability for actions taken based on the materials.