For this engagement we built an MVP developer platform, based on Kubernetes, in a short timeframe (3 months) with 2 engineers. The goal was to get a small number of initial engineering teams' application live.
To do that we needed to be very careful about scope:
- adding only the features required to get the initial applications live
- future proofing the platform for increased adoption without disruption for the initial tenants
- ensuring the platform meets the reliability requirements of the initial applications
A key feature is how to know the platform is working, capturing problems before the platform’s customers do. For that we need:
- A clear boundary of what the platform manages, and what the tenants manage (what AWS calls the “shared responsibility model”). What the platform manages should be covered by SLOs & SLIs.
- Use the user journey as the north star
Context and Problem Statement
We believe the best way to do this is to define SLOs & SLIs for the platform and derive alerts based on it. By doing so:
- All stakeholders to the platform get a consistent, measurable expectation of what the platform delivers from SLI & SLO definitions.
- When the SLOs are at risk or not being met, alerts can be triggered to notify the platform team to take action.
- The SLOs & SLIs must be agreed with the business
- The SLOs & SLIs must be measurable
- The alert definition must be based on SLOs & SLIs
- The alert must be actionable
- Having the right amount of SLIs & SLOs - too many causes attention fatigues, whereas too few causes oversight.
- Exercised in a structured manner - drive the expectation using SLOs, measuring using SLIs, and implementing using metrics & alerts.
- Perfect is the enemy of good - implement the SLOs & SLIs good enough and improve over time.
- It’s not a closed-door exercise - make sure it’s agreed with the business.
- Alerts must be actionable - if the alert is not actionable, it’s not an alert, it’s a notification.
- As the tenant of the platform I want to deploy and onboard app to the platform via the platform control plane. Large majority of the API requests to the control plane are successful, assuming the request is valid. (from availability perspective)
- As the tenant of the platform I want computing resources requested by my app to be scheduled and available in a timely manner assuming
- the app is functional
- the resource requested is reasonable (from latency perspective)
|Availability||The proportion of HTTP requests processed successfully by the platform k8s API server, measured from the k8s API server scrapping endpoint. Any HTTP status other than 5xx is considered successful.||x% over a 30 day rolling window|
|Availability||Consecutive minutes where the API server is not accessible by the monitoring system. The API server is considered to be accessible when the API server endpoint can be hit from a health check probe running inside the cluster||consecutive minutes <= X|
|Latency||Startup latency of schedulable stateless pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes||99th percentile per cluster-day <= 5s|
-  This might be tricky to measure. Another option is to measure “startup latency of schedulable pods, measured from pod creation timestamp to when the pod is marked as
scheduled, measured as 99th percentile over the last 5 minutes.
- As the tenant of the platform I want to make sure that there’s enough capacity on the platform to schedule my workload
|Availability||The proportion of cpu available on the cluster||1 node worth of cpu available|
|Availability||The proportion of memory available on the cluster||1 node worth of memory available|
- Tenants are expected to request enough resources to sustain a single node outage as part of their deployment
- Keeping 1 node’s worth of resource as a buffer is only required for (quick) autoscaling, but we keep it in our case to provide a minimum zonal redundancy given the small number of tenants we have (and so nodes)
Data Plane Networking
- As the tenant of the platform I want to make sure that my application can reach out to the services of on-prem DC and over the internet to 3rd parties.
|Availability||The proportion of successful http requests to on-prem web service made by the blackbox health check probes, measured from the metrics endpoint provided on the probes. Any responses come back from the on-prem web service is considered to be successful.||x% over a 30 day rolling window|
|Availability||The proportion of successful http requests to the internet service made by the blackbox health check probes, measured from the metrics endpoint provided on the probes. Any responses come back from the internet service is considered to be successful.||x% over a 30 day rolling window|
|Availability||The proportion of successful http requests to <INSERT_AN_ON_PREM_COPONENT_HERE> by the blackbox health check probes, measured from the metrics endpoint provided on the probes. Any responses come back from Vault is considered to be successful.||x% over a 30 day rolling window|
-  Ideally running a http(s) request to a FQDN thus we check both TCP & UDP end 2 end. The unideal side is any external outages that are out of our control might affect the QoS we measure internally as a platform. From an implementation perspective this can be achieved by the prometheus blackbox exporter.
- As the tenant of the platform I want my application to be reliably accessible via the internet through
platform.$CORP.comdomain name and on-prem DC securely via HTTPS through a load balancer. (from availability perspective)
- As the tenant of the platform I want the load balancer used by my application with acceptable latency to ensure responsiveness of my application. (from latency perspective)
- As the tenant of the platform if I use the managed certificate from the load balancer, I want to make sure the certificate is valid and not expired. (from availability perspective)
|Availability||The proportion of successful http requests to the internal / external load balancer’s ||x% over a 30 day rolling window|
|Latency||The proportion of sufficiently fast requests hitting the ||95 of the requests <= 100ms over a 30 day rolling window.|
|Availability||Expiry date of the TLS certificate for both internal / external load balancer. The expiry date is defined by the ||X day before certificate expire|
-  The ping path must have a backend service deployed to ensure the http requests are tested e2e.
- As the tenant of the platform I want to be able to have observability of the application from the Argus platform (from availability perspective).
|Availability||The proportion of times when the grafana agent is available, measured by the up query on the Grafana Cloud prometheus database .||x% over a 30 day rolling window|
As it stands we primarily use SLO-based alerts. This helps us to narrow the scope of our alerts to the symptoms that genuinely impact the service reliability that are experienced by our customers (tenants & end users in our cases). By decoupling symptoms from what and why, traditional system monitoring alerts (cpu, memory, disk USE on a node-by-node basis) are largely eliminated. There are a few benefits of this approach:
- It helps operators to pin-point where the investigation should start (where the customer experiences are actually impacted) with minimum noises.
- It makes the alerts more actionable vs traditional system monitoring where CPU, OOM errors tell you nothing about whether end users are impacted, neither it’s is easy to act upon.
- It largely reduced the alerting fatigue, as the number of alerts were reduced significantly.
- It generally surfaces the issues earlier than traditional system monitoring alerts. For example your app might appear to be looking fine from the traditional system monitoring perspective, but alerts show excessive SLO error burn rate (e.g. 10% error budget burned over 5 mins time, meaning 50min before SLO breaches). After alerts you drill down to the issue and pin-point the root cause to a OOM of the app. For this scenario traditional system alerts would have surfaced the issue much later, when the OOM flooded the entire fleet.
That being said, system/low-level monitoring metrics are still useful for observability and issue drill down in general. They need to be collected for observability/warning purposes, but not necessarily alerting.
As we mature platforms the next stage is to introduce synthetic load so that we don’t want for a platform user to do something to discover a problem. For example:
- A continuous test exercising the tenant journey of onboarding and deploying a new application
- Continuous traffic exercising all network paths rather than waiting for tenant traffic to trigger an SLO
For this traffic we may define internal alerts rather than business visible SLOs, as they are designed to discover problems so the business is never affected by platform issues.