Reducing Downtime During LangSmith Helm Upgrades

Last updated: April 8, 2026

When upgrading self-hosted LangSmith via helm upgrade, you may experience extended unavailability due to:

  1. Container image pulls during the rollout (especially with private registries or air-gapped environments)

  2. All pods being terminated before new ones are ready

  3. Database migrations running sequentially across large version jumps

This guide covers three strategies to minimize or eliminate downtime during upgrades.

Prerequisites

  • kubectl access to your cluster

  • Your current values.yaml

  • The target chart version (check the AppVersion with helm show chart langchain/langsmith --version <version>)

1. Enable PodDisruptionBudgets

PodDisruptionBudgets (PDBs) prevent Kubernetes from terminating all pods of a service at once during a rollout. Add the following to your values.yaml:

backend:pdb:
    enabled: true
    minAvailable: 1frontend:pdb:
    enabled: true
    minAvailable: 1queue:pdb:
    enabled: true
    minAvailable: 1ingestQueue:pdb:
    enabled: true
    minAvailable: 1platformBackend:pdb:
    enabled: true
    minAvailable: 1aceBackend:pdb:
    enabled: true
    minAvailable: 1

Note: PDBs require replicas >= 2 for each service. If a service has only 1 replica, the PDB will block the rollout since Kubernetes cannot maintain minAvailable: 1 while replacing the single pod.

2. Pre-Pull Images Before Upgrading

Image downloads can account for a significant portion of upgrade time. You can eliminate this by caching all new images on every node before running helm upgrade.

Image tags

Most LangSmith services share the same image tag (the chart's AppVersion). The exceptions are:

Service

Tag

All LangSmith services (backend, frontend, go-backend, ace-backend, playground, clio, polly, agent-builder-*)

Chart AppVersion (same tag for all)

langgraph-operator

Independent tag (see images.operatorImage.tag in your values.yaml)

postgres, redis, clickhouse, quickwit

Pinned tags in values.yaml (typically unchanged between upgrades)

DaemonSet template

Deploy a DaemonSet that runs an init container for each image. Adapt the registry and image paths to match your values.yaml:

# prepull-daemonset.yamlapiVersion: apps/v1kind: DaemonSetmetadata:name: langsmith-image-prepullnamespace: <namespace>spec:selector:
    matchLabels:
      app: langsmith-prepulltemplate:
    metadata:
      labels:
        app: langsmith-prepull
    spec:
      imagePullSecrets:
        - name: <pull-secret>   # must match your imagePullSecrets
      initContainers:
        - name: pull-backend
          image: <registry>/langchain/langsmith-backend:<NEW_VERSION>
          command: ["true"]
        - name: pull-frontend
          image: <registry>/langchain/langsmith-frontend:<NEW_VERSION>
          command: ["true"]
        - name: pull-go-backend
          image: <registry>/langchain/langsmith-go-backend:<NEW_VERSION>
          command: ["true"]
        - name: pull-ace-backend
          image: <registry>/langchain/langsmith-ace-backend:<NEW_VERSION>
          command: ["true"]
        - name: pull-playground
          image: <registry>/langchain/langsmith-playground:<NEW_VERSION>
          command: ["true"]
        - name: pull-clio
          image: <registry>/langchain/langsmith-clio:<NEW_VERSION>
          command: ["true"]
        - name: pull-polly
          image: <registry>/langchain/langsmith-polly:<NEW_VERSION>
          command: ["true"]
        - name: pull-langserve-backend
          image: <registry>/langchain/hosted-langserve-backend:<NEW_VERSION>
          command: ["true"]
        - name: pull-tool-server
          image: <registry>/langchain/agent-builder-tool-server:<NEW_VERSION>
          command: ["true"]
        - name: pull-trigger-server
          image: <registry>/langchain/agent-builder-trigger-server:<NEW_VERSION>
          command: ["true"]
        - name: pull-deep-agent
          image: <registry>/langchain/agent-builder-deep-agent:<NEW_VERSION>
          command: ["true"]
        - name: pull-operator
          image: <registry>/langchain/langgraph-operator:<OPERATOR_VERSION>
          command: ["/manager"]
      containers:
        - name: pause
          image: <registry>/library/redis:7   # or any lightweight image already cached
          command: ["sleep", "infinity"]

Note on the operator image: The langgraph-operator image is a minimal Go binary and does not include /bin/sh or true. Use command: ["/manager"] for this container. If any other image also fails with command: ["true"], try command: ["/bin/sh", "-c", "exit 0"] instead.

Running the pre-pull

# 1. Deploy the DaemonSet and wait for all images to be cached on every node
kubectl apply -f prepull-daemonset.yaml
kubectl rollout status daemonset/langsmith-image-prepull -n <namespace> --timeout=600s

# 2. Once all pods are Running, proceed with the upgrade
helm upgrade <release> langchain/langsmith --version <new-version> --values values.yaml

# 3. Clean up the pre-pull DaemonSet
kubectl delete daemonset langsmith-image-prepull -n <namespace>

You can reuse this DaemonSet for every upgrade by updating the image tags.

3. Monitor Migrations During Upgrade

Database migrations run as Kubernetes Jobs and are typically the main source of remaining downtime after image pulls are eliminated. Monitor their progress:

kubectl get jobs -n <namespace> -w
kubectl logs job/<release>-langsmith-backend-migrations -n <namespace> -f
kubectl logs job/<release>-langsmith-backend-ch-migrations -n <namespace> -f

Additional Recommendations

  • Upgrade in smaller increments: Jump 2-3 minor versions at a time rather than spanning many versions in a single upgrade. Each jump runs fewer migrations, reducing both the migration window and rollback risk.

  • Check your ClickHouse pullPolicy: If set to Always, the image will be re-downloaded even if it is already cached. Consider using IfNotPresent with a pinned version tag, or include the ClickHouse image in the pre-pull DaemonSet.

  • Test in a staging environment first: If possible, run the upgrade on a non-production cluster to measure migration duration and catch issues before upgrading production.