Identity-based Authentication for a Developer Platform

Author: Tomasz Bartosiewicz | Posted on: April 19, 2024



As well as the increase of developer productivity (which equals more frequent releases) and enhanced developer experience, one of the key goals of the platform team is to provide architectural solutions and tools to very often complex but common problems encountered by application developers. One of the common challenges is an authentication and authorisation layer not only required for the team’s developed services but also for the platform’s internal applications.

For internal services, traditionally, such services would only be exposed on an internal network. While this was regarded as the most secure way of exposing internal services, it increases the networking complexity, creates a single point of failure and might pose unwanted consequences such as preventing users from accessing the applications from some particular device or geographical location.

For public-facing services, a common approach is for teams to develop their own authentication layer. This very often results in replicated and outdated authentication mechanisms that are scattered across the organisation, pose high security risks and as a result, are hard to mandate by security team policies.

It makes sense for the platform engineering team to encapsulate this critical functionality which not only ensures the organisation’s services security compliance but also reduces the tenant team’s delivery time when exposing publicly facing and internal services (sometimes even by months!).

This blog demonstrates our technical approach to providing such a feature that allows us and our tenants to expose internal services by utilising Google Identity-Aware Proxy (IAP) provided by our cloud provider in one of our in-house developed platforms running on Google Kubernetes Engine (GKE).

This approach really does provide the dream outcome for security, compliance and engineering functions as it enables authentication to now be controlled, managed and implemented correctly throughout the whole business with ease and confidence.



Google Identity-Aware Proxy (IAP)

Google Cloud IAP is a built-in service that helps control access to applications and protect them from unauthorised users. It acts as a proxy sitting in front of your applications and verifying the identity of users before granting them access. Identities can come from Google workspace that is part of your organisation as well from external identity providers by leveraging the Identity Platform . IAP is a key component within Google’s BeyondCorp security model. It helps implement the principles of BeyondCorp by enabling organisations to control access to their internal applications based on user identity and context rather than networks. For internal services, it also significantly improves user experience by providing an out-of-the-box single sign-on experience.

Incorporating IAP for internal services to the developer platform brings several benefits, such as:

  • extracting away the team’s engineering effort to implement a set of complex processes of authentication and authorisation
  • providing out-of-the-box identity-based access management
  • allowing tenant teams to rapidly and securely expose services on the internet in a way that is compliant with organisation security policies from day one


Objectives

Apart from ensuring that internal services exposed on public endpoints cannot be accessed by unauthorised users, common objectives when designing authenticated endpoints on external ingress are:

  • avoid if possible unnecessary networking complexity (tunnelling/port forwarding) to simplify the overall network architecture
  • easy to use tenant interface with seamless exposure of a service driven from the tenant’s environment configuration
  • ensure a single, easy-to-use place to manage users/groups
  • ensure scheduled teardown and recreation of platform environments is feasible


Our current ingress architecture

To recap, our current architecture consists of several components:

  • Global External L7 Load Balancer (GLB) acting as platform entry point
  • Cert Manager to manage the lifecycle certificates for that load balancer
  • Letsencrypt to issue the certificates dynamically
  • External DNS to dynamically create the DNS records for all the tenant applications
  • Traefik as an internal load balancer

For a diagram and detailed explanation of the architecture please see our previous Multi-Tenant Ingress for a GKE-based Developer Platform blog.

To avoid teams having to go through a very often painful and lengthy process of obtaining the domain and corresponding certificates, our platform provides a common domain (ex. developer-platform.cecg.io) out of the box that can be utilised by the tenant’s exposed services. Tenants are only required to define their service under the platform’s domain in their environment configuration, for example: a service named “myapp” would be under the domain: myapp.developer-platform.cecg.io. The deployment mechanism then translates it to the Kubernetes Ingress object and deploys it to the tenant environment. This approach allows us to control crucial infrastructure required to expose the network endpoint while providing teams with a simple interface that allows them to rapidly expose their services.



How our network stack is configured

Our external load balancer is configured via Gateway API managed by the GKE Gateway Controller, where a listener is registered that defines a hostname that matches all requests (via Host header) under our platform public domain developer-platform.cecg.io. Our cluster utilises Cert Manager and ExternalDNS to manage the DNS records and certificates for all our services, resulting in creation of myapp.developer-platform.cecg.io DNS record and service specific myapp.developer-platform certificate. The Global Load Balancer (GLB) routes are configured via HTTPRoute object specifying the routing behaviour of http/s requests from the listener to our internal load balancer (Traefik Service). Traefik then is configured via Ingress API with the specific tenant target service to forward the request to.



Our approach

Following our proven methodology when tackling major architectural decisions we thoroughly investigated what technology could help us solve this challenge and fit our current architecture, which always results in new ADR being produced. As of the time of writing, we’ve identified two main contenders: Oauth2-Proxy and GCP IAP . As we run our platform on GCP we picked GCP IAP mainly as:

  • It’s natively integrated, no need to deploy or maintain new components
  • Satisfies all authentication identity requirements: support users, groups, service accounts
  • It can be extended to support external identities in the future
  • It is managed by GCP, guaranteed by Service Level Agreements (SLAs)

Oauth2-Proxy provides similar features; however, we’d have to run additional components which would add further complexity, cost and maintenance of a single point of failure component within the platform.



Implementation

To limit the costs and the burden of maintaining multiple external ingresses while still ensuring satisfactory segregation between internal and external traffic we re-used the current external load balancer and expanded its network endpoint configuration to match two distinct paths. To enable request host matching at the GLB level we have added an additional listener to our Gateway object that is configured to match hostnames on the platform’s new pre-defined secure network subdomain: secure.developer-platform.cecg.io.

By default, the GKE Gateway Controller creates a load balancer with a default configuration that can be modified further by use of Policies. GCPBackendPolicy allows configuring GLB backends with IAP that in turn enforces access control policies associated with HTTPRoute configuration. To enable authenticated paths we have deployed a new Traefik Service that is targeted by an IAP-enabled route. To maximise reusability we have configured our new service to forward traffic to existing Traefik pods. From there the traffic gets forwarded based on the tenant’s Ingress configuration that routes traffic to distinct target services.



Considerations

While platform component reusability reduces development effort, costs and maintenance burden it does increase the blast radius of the component in case of a failure. It also increases upgrade complexity as any upgrades or updates to components exposing the secure endpoint will affect both endpoints. Furthermore, given our network components' interdependency, when re-deploying the environment in case of an issue with obtaining a certificate (Let’s encrypt) for any endpoint our external load balancer will not be created. This means that in case of an issue with the DNS, all services will be affected including external ones. To mitigate the above it is worth considering the deployment of a completely segregated stack starting with external GLB (through GatewayClass object) all the way to the internal Traefik instance.

Due to the nature of our platform and its current target audience, we went with a simple approach for now with a plan to expand to an isolated network model in the future.



IAP limitations

As of the time of writing the GCP IAP is far from perfect. To enable IAP there is a requirement to create a so-called Brand. Brand represents an identity-aware proxy brand and it is necessary to manage user-facing customizations for the Identity-Aware Proxy’s OAuth Consent Screen. Unfortunately, Brand API is quite poor. There can be only a single brand per project, and once created, the brand cannot be deleted. This proves to be problematic with scheduled environment rebuilds that are fully managed by Terraform, as initial creation would succeed, but tear down wouldn’t delete the resource (unless you’re planning to destroy the whole GCP project!). To reflect the single project to Brand dependency we have incorporated Brand creation into our initial GCP organisation bootstrap which resulted in the creation of GCP projects with Brand pre-populated. The Brand reference is static, meaning it doesn’t change after initial creation, so can be hardcoded where reference to the Brand is required.



Summary

Providing authentication features can be a complex and lengthy task. In addition, a custom implementation of such a feature at the platform level requires constant re-evaluation and careful management to ensure the ongoing security of an organisation’s assets. By utilising available cloud offerings we can significantly reduce the development time of such features while ensuring those are proven and well-tested solutions adhering to the highest standards and cutting edge technology trends.