This article provides a guide on how to serve large language models on multiple nodes running on Kubernetes.

Challenges

Large Language Model (LLM) serving is a challenging task. Namely:

It demands a lot of vRAM. For modern open-weight models such as Llama3.1 70b instruct, with 16bit floating point hyperparameter decision, it requires around 140GB of vRAM to load the model into memory. Apart from the model itself, each node needs to accommodate the KV cache for the incoming requests. Given it also needs to support 128k token length, the KV cache can easily demand 35GB more vRAMs (well depends on whether you can put up with CPU RAM swapping or smaller max model length). This leads to 140 + 35 + 40 = 215GB vRAM usage.
With such a large vRAM memory footprint. Fitting everything in a single node is impossible, since it requires 3 H100 NvLink GPUs (each with 80GB vRAM). This leads to the requirement of gang-scheduled multi-node inference.
To scale the inference to accommodate more concurrent users, ideally, we also want to scale the workload horizontally. However, under the context of Kubernetes, the existing deployment patterns such as deployment and StatefulSets do not appear to be fit for purpose.
Loading a large language model onto the Kubernetes nodes is also non-trivial. For example, a Llama3.1 70b model along uses 140GB disk space, and the newest vLLM container weighs 5GB. If we naively try to load the model into k8s nodes from container registry or huggingface, it is going to take around 1 hour or more. The slowness not only makes startup slow, but also incurs high computing resource wastage (imagining 3 H100 GPUs sitting idle for an hour just to load the model).

Introducing LWS

LWS is a set of new API primitives and controllers sits on top of the Kubernetes API server. It extends the Kubernetes API to support the multi-node inference use case. Currently, it is a project governed by the Kubernetes Special Interest Group (SIG). According to the documentation:

LeaderWorkerSet: An API for deploying a group of pods as a unit of replication. It aims to address common deployment patterns of AI/ML inference workloads, especially multi-host inference workloads where the LLM will be sharded and run across multiple devices on multiple nodes. The initial design and proposal can be found at: http://bit.ly/k8s-LWS.

Here is the high-level concept of the LeaderWorkerSet:

To summarize the architecture:

We have a worker set that comes with zero or more replicas of a worker group.
Each of the work groups is a serving unit that might consist of one or more pods that are clustered together.
Each of the work groups comes with a leader pod managed by a leader StatefulSet (sts). The leader pod is responsible for:
- Serving as an entrypoint for the incoming inference requests.
- On initial loading it is responsible for loading the model and farming out the hyperparameters to the workers (well under the context of vLLM).
- Coordinating the workers to perform the actual inferencing.
For every worker group it also has a StatefulSet of workers that is clustered with each leader.
A worker group set can be scaled up and down via change the replicas field in the worker set spec.
Each worker group can also be "horizontally" scaled by change the "size" where size = leader + worker counts.
To ensure minimum network latency (not everyone is on infiniband!), the solution is also topology-aware. It makes sure the leader and workers that belong to the same worker group are scheduled onto the same topology domain (node/subnet/zone etc.).
The failure mode handling of LWS is very all-or-nothing. If any of the pods within a worker group failed, the entire worker group will be killed and rescheduled all over again.

Deploying LWS Operator

Currently, the official approach is to deploy via yolo install.sh script, which is understandable considering the project is fairly bleeding edge.

However, this is not particularly operable in the production environment. As a result, I forked the repo and "productionized" the operator's kustomize configuration into a helm chart - https://github.com/jingkaihe/lws/tree/main/charts/lws. I didn't handcraft the helm chart manually, instead it's automatically generated by some good old make config. All credit to the awesome helmify script.

Currently, you can pretty much have a productionized deployment via

variable "service_name" {
  type    = string
  default = "lws-webhook-service"
}

variable "image" {
  type = object({
    repository = string
    tag        = string
  })

  default = {
    repository = "registry.k8s.io/lws/lws"
    tag        = "v0.3.0"
  }
}

resource "kubernetes_namespace" "lws_system" {
  metadata {
    name = "lws-system"
  }
}

resource "helm_release" "lws" {
  name             = "lws"
  repository       = "oci://ghcr.io/jingkaihe"
  chart            = "lws"
  version          = "0.1.0"
  namespace        = kubernetes_namespace.lws_system.metadata[0].name
  create_namespace = false
  max_history      = 3

  set {
    name  = "installCRDs"
    value = "true"
  }

  values = [
    yamlencode({
      fullnameOverride = "lws"
      manager = {
        image = {
          repository = var.image.repository
          tag        = var.image.tag
        }
        resources = {
          requests = {
            cpu    = "500m"
            memory = "512Mi"
          }
        }
      }
    }),
  ]
}

Deploying a LWS

The actual LWS CRD is pretty gruesome, thus this article won't go too many details on it. Here is an example of how to deploy a LWS powered by vLLM with distributed runtime managed by Ray - https://github.com/kubernetes-sigs/lws/tree/main/docs/examples/vllm.

Operational Caveats

Black hole the metrics endpoint

By default, the vLLM openAI compatible endpoint exposes the prometheus metrics right at /metrics path. You probably want to blackly hole the external access. You can do it via slapping "nginx.ingress.kubernetes.io/server-snippet" = "location = /metrics { return 404; }" onto your ingress annotation. The example uses nginx, but the same idea can be applied to another ingress controller.

Lengthen the timeout for the incoming requests

Unlike the conventional HTTP requests/response, the inference response is usually streamed back to the client side in chunks. As a result, you probably want to increase the timeout on the ingress end with something looking like

    annotations = {
      "kubernetes.io/ingress.class"                             = "nginx"
      "nginx.ingress.kubernetes.io/rewrite-target"              = "/"
      "nginx.ingress.kubernetes.io/proxy-read-timeout"          = "60"
      "nginx.ingress.kubernetes.io/proxy-send-timeout"          = "60"
      "nginx.ingress.kubernetes.io/proxy-next-upstream-timeout" = "60"
    }

vLLM parameters

Here are some parameters you might want to tweak:

--tensor-parallel-size should be more or less equal to lws_size * number-of-gpus and it must be a factor of 2 (2, 4, 8, 16, etc.).
--max-model-len can be set to a small number if you are vRAM poor. It results in a reduced context window but saves vRAM.
--enable-chunked-prefill=false can also be used if you constantly hit OOM.
--swap-space is pretty hard to tune. Essentially, the models are loaded into the CPU RAM first then being loaded into the vRAM. By default, it's 4GB, but you can set it higher to speed up the model loading but do make sure you don't run into OOM.

Observability

The prometheus metrics are available out of the box for scraping. https://github.com/vllm-project/vllm/blob/main/examples/production_monitoring/grafana.json should also give you some ideas how to visualize the data.

Additionally, you probably also want to measure the GPU usage. If you are using NVIDIA GPU, you can you the gcdm-exporter. That being said, what metrics can be scrape is a little bit obscure. Personally I ended up with a combination of kubectl port-forward the gcdm-exporter DaemonSet endpoint and reverse engineering the metrics and this doc in the mix.

One thing you will notice is the GPU utilization of vLLM nodes always stays at 90-ish percentage. This is because --gpu-memory-utilization is set to 0.9 by default, besides all the model weights and KV cache are pretty much preloaded. It sorta works like jvm -Xmx and -Xms where the memory is pre-allocated.

Big artifact loading

As this article has previously suggested, naively loading the models and vLLM container images from huggingface registry or container registry is pretty much a no-go.

Here we highlight a few techniques you can apply to speed up the loading process.

GKE Image Streaming

In our case we use GKE image streaming feature to speed up the vLLm container image loading. According to [this doc,] this is how it works behind the scene:

With Image streaming, GKE uses a remote filesystem as the root filesystem for any containers that use eligible container images. GKE streams image data from the remote filesystem as needed by your workloads. Without Image streaming, GKE downloads the entire container image onto each node and uses it as the root filesystem for your workloads.

While streaming the image data, GKE downloads the entire container image onto the local disk in the background and caches it. GKE then serves future data read requests from the cached image.

When you deploy workloads that need to read specific files in the container image, the Image streaming backend serves only those requested files.

LLM model loading

The biggest loading overhead comes from the LLM model loading into the container itself. We managed to speed up the process via:

pip install -U "huggingface_hub[cli]" "transformers[torch]";
apt-get update && apt-get install -y curl;
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
install kubectl /usr/local/bin/;

huggingface-cli download $${MODEL_NAME} --exclude "original/*";

python -c "from transformers.utils.hub import move_cache; move_cache()";

cat <<EOF | kubectl apply -f -
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: ${each.key}-snapshot-$(date +%s)
spec:
  volumeSnapshotClassName: llm-snapshot
  source:
    persistentVolumeClaimName: llm-snapshotter-${each.key}

Preprovision a PVC with generous size at least can fit the model weights in. Practically we managed to fit the 70b model into 200GB PVC with provides 1GB/s+ model loading speed into CPU RAM.
Run a Kubernetes job with the PVC mounted to the job container. Download the model weights into the mounted directory. And towards the end take a snapshot. The process looks like this:

Note you absolutely need python -c "from transformers.utils.hub import move_cache; move_cache()"; otherwise it will be carried out on the actual model serving which causes error since the mounted volume is not writable.

When you want to serve the model, you restore the snapshot into a PVC with storage_class_name: standard-rwo and access_modes: ReadOnlyMany, so that it can be mounted to multiple pods. Notes the provisioned IO throughout will be shared by all the pods, so you might want to give it a generous size depending on the size of your fleet.
Voilà! In your LWS resource definition you can now reference the PVC so that you can load the model directly from disk without the need to download it over the wire.

There are also a few options that have been considered but are not chosen eventually:

FUSE uses GCS/S3 as the backend—practically we noticed it is painfully slow with approx. 100MB/s read speed against GCS. In theory, it can be improved with Google private service connected, but we have not verified it.
NFS-based solutions such as AWS EFS or Google File Store. It was not picked due to 1) NFS downtime will have a knock-down effect on the entire system. 2) The Cloud-based NFS solution generally comes with a hefty cloud bill, especially the IOPS often is tied to the volume that is provisioned.

Via this approach we managed to reduce the llama3 70B load time from 60 mins (via GCS) to 3 mins from GPU node being read to vLLM cluster being entirely up and running.

This Guide from anyscale provides a more mind-blowing result, however, is not applicable to us since we are not using AnyScale Endpoint.

Final Thoughts

In this article we showed that with SIG LWS it is possible to serve LLM in a multi-node manner.

Self-hosted LLM is an appealing option when

Proprietary solutions such as OpenAI or Anthropic cannot be used due to privacy, data sensitivity or regulatory concerns,
Your LLM inference usage reached a certain critical mass where running your own model is higher cost-effective than using third party inference as a service in terms of cost per 1000 tokens.

That being said, it also comes with a few downsides, namely:

Complexities in terms of infrastructure and multi-pod gang scheduling on top of Kubernetes.
Unless your usage reaches a certain critical mass, the cost per token is orders of size higher vs. using third party inference services, unless you are willing to scale down to zero. However, this is a significant sacrifice in terms of “first token latency” depending on the type of workload and whether it is tolerable by the end customer.
You are likely outcompeted by hyper-specialized third party services if managing LLM is not your core competency.
The model you are serving is likely to be inferior vs. OpenAI.

This article is provided as a general guide for general information purposes only. It does not constitute advice. CECG disclaims liability for actions taken based on the materials.

Multi-Node LLM Serving Using sig LWS and vLLM